Byron,

I have been using 2.4.x kernel on dual xeon boxes, and the performance seems
good. I am curious about 2.6.5 kernel stability and performance. Any light
you can shed will be helpful. I am looking for hard numbers. Your Nutch
stats will do.


Here are the issues I found with Nutch last year:

- there is a futex() bug in the backported NPTL so you need to set
LD_ASSUME_KERNEL 2.4.1. Or else Java will hang after a day of running.
- java 1.4.2 has a serious bug in decoding a JP character set. So the
NekoHTML parser goes into an infinite loop on some JP pages. This is fixed
in Java 1.5. I use Java 1.4.1.
- 0.9.2 of NekoHTML fixes several bugs - use this. I have run it on over 100
M documents. I still see the following stats per million documents indexed
    30 DOM Exceptions
    1600 Bad content length
 
- When running on multiple nodes, I do not see where in the Nutch code the
document frequency in normalized for IDF calculations. I do not see the
classic two pass. If someone can show me where this is done it will be
helpful. 

I used Nutch as an uber example for using Lucene. The code base was very
helpful.


Ram

On 5/7/04 6:32 AM, "Byron Miller" <[EMAIL PROTECTED]> wrote:

> Massimo,
> 
> I have found that i had horrible performance with
> RedHat 9.0 and kernel 2.4.x on the Xeon machines..
> 
> I did a Yum Update to Fedora Core 2 rc3 on my servers
> and kernel 2.6.5 and that made a world of difference
> in stability, speed and performance. - On the xeons
> you sometimes have to disable HT, ACPI and other
> features to get things to stabalize.
> 
> The process for me to build the index was this guide
> (over and over and over)
> 
> http://www.nutch.org/docs/en/tutorial.html
> 
> I did the entire dmoz (not a subset) and i only ran
> the link analysis as 1 iteration (couple of times in a
> row) and when i did new segments i did about 6-m
> million at a time.
> 
> bin/nutch generate db segments -topN 6000000
> s2=`ls -d segments/2* | tail -1`
> echo $s2
> 
> bin/nutch fetch $s2
> 
> bin/nutch updatedb db $s2
> 
> bin/nutch analyze db 1
> bin/nutch analyze db 1
> 
> To be truthfull i am interested in the distributed
> webdb myself so as i grow over 300+mill i can share
> the load of analyzing and such.
> 
> -byron
> 
> 
> --- Massimo Miccoli <[EMAIL PROTECTED]> wrote:
> 
> ---------------------------------
>   Hi,
> I use Redhat 9 with  kernel 2.4.20-30.9bigmem (for 8gb
> of ram)
> java version "1.4.2_04"
> Java(TM) 2 Runtime Environment, Standard Edition
> (build 1.4.2_04-b05)
> Java HotSpot(TM) Client VM (build 1.4.2_04-b05, mixed
> mode)
> 
> 3ware raid with 12 disk of 160GB.
> 
> THX,
> 
> Massimo
> 
> 
> Byron Miller wrote:
> 
> I'm up to 110 million urls on a Dual Xeon with 2
> gigsof memory and while it just takes a while for
> analysisit does complete without error.What
> OS/Platform are you trying and what JVM do
> youuse?-byron--- Massimo Miccoli
> <[EMAIL PROTECTED]> wrote:
>     
> Ciao,First, my compliments for the Nutch code.My name
> is massimo and I follow the nutch projectfrom the
> firts day. I have test any new patched release(CVS).
> Now I want try the NutchFs. I have many boxes and disk
> andmanyproblem with webdb on LinkAnalisys when the db
> haveabout 40.000.000 of urls, also with a dula xeon
> server and8 GB of ram. So ther is a a solution by
> modify the nutch binfile to integrate the distribute
> version of webdb?Many thanks,Massimo
>   
> -------------------------------------------------------
> 
>     
> This SF.Net email is sponsored by Sleepycat
> SoftwareLearn developer strategies Cisco, Motorola,
> Ericsson& Lucent use to deliverhigher performing
> products faster, at low TCO.
>   
> http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3
> 
>     
> _______________________________________________Nutch-general
> mailing [EMAIL PROTECTED]
>   
> https://lists.sourceforge.net/lists/listinfo/nutch-general--------------------
> -----------------------------------This
> SF.Net email is sponsored by Sleepycat SoftwareLearn
> developer strategies Cisco, Motorola, Ericsson &
> Lucent use to deliverhigher performing products
> faster, at low
> TCO.http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3__________________
> _____________________________Nutch-general
> mailing
> [EMAIL PROTECTED]://lists.sourceforge.net/lists/lis
> tinfo/nutch-general
> 
> -------------------------------------------------------This
> SF.Net email is sponsored by Sleepycat SoftwareLearn
> developer strategies Cisco, Motorola, Ericsson &
> Lucent use to deliverhigher performing products
> faster, at low
> TCO.http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3__________________
> _____________________________Nutch-general
> mailing
> [EMAIL PROTECTED]://lists.sourceforge.net/lists/lis
> tinfo/nutch-general
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by Sleepycat Software
> Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver
> higher performing products faster, at low TCO.
> http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3
> _______________________________________________
> Nutch-general mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-general



-------------------------------------------------------
This SF.Net email is sponsored by Sleepycat Software
Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver
higher performing products faster, at low TCO.
http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to