Re: Nutch generate and fetch very slow after a few crawls (results)

ML mail Wed, 26 Nov 2008 07:43:39 -0800

Dear Dennis

Here a few lines to explain better our current configuration: Nutch 0.9 is 
running in an OpenVZ Virtual Private Server with 512 MB RAM allocated to it and 
full CPU usage (a Pentium IV Xeon 2.8 GHz, the machine itself has 1 GB physical 
RAM). NUTCH_HEAPSIZE in the nutch shell script is set to 350 MB instead of the 
original 1000 MB.


Right now it is running the generate step since approx 3 hours for a topN of 
40000 pages and usually it takes between 4 and 5 hours. The command top shows 
the following: 

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
                                       
16268 nutch     18   0  471m 159m 5808 S  100 31.1 171:03.82 java              

So it looks like the CPU is the limiting factor in our case, when I interpret 
this correctly. This is quite strange because a Pentium IV 2.8 GHz is not so 
old...

Regarding the swapping, I can't see any swapping in the VPS as there is no swap 
really in OpenVZ, here is the output of "free -m" during the generate step:

             total       used       free     shared    buffers     cached
Mem:           512        477         34          0          0          0
-/+ buffers/cache:        477         34
Swap:            0          0          0

Nutch binaries is stored and running from a Linux/MD software RAID 1 set with 
two Seagate ATA 7200 rpm harddisks. The crawl directory and all its data is 
stored on a 8 TB Linux RAID 6 NAS mounted via NFS with the following NFS mount 
options:

rw,tcp,rsize=8192,wsize=8192

But if I understand correctly during the generate step of Nutch there is only 
very low I/O activity because it is only querying the crawldb, am I correct ?

About the URL filter we use pretty much the default crawl-urlfilter.txt file 
except that we added maybe 2-3 more extensions to skip and changed the "accept 
hosts" to only index those ending in .be. You will find our crawl-urlfilter.txt 
file at the end of this mail.

So I hope I have provided you enough information, if you need more just let me 
know.

Thanks again and best regards


---------- crawl-urlfilter.txt -------

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.).*\.be/

# skip everything else
-.

---------- crawl-urlfilter.txt -------




--- On Wed, 11/26/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:

> From: Dennis Kubes <[EMAIL PROTECTED]>
> Subject: Re: Nutch generate and fetch very slow after a few crawls (results)
> To: [email protected]
> Date: Wednesday, November 26, 2008, 4:46 AM
> Well, generate will still have to go through all of the
> urls, although 
> skipping after 25 per domain should be really quick.  It
> really depends 
> on your hardware and any regexes you may be running in
> urlfilters for 
> the generate step.  A 2.8Ghz Xeon with 1G ram should be
> pretty quick. 
> 100,000 pages on a core2duo on my laptop (4G Ram) takes
> less than an 
> hour if I remember correctly.  What type of hard drive
> (speed) are you 
> using and are you swapping a lot during generate?  It may
> be the amount 
> of RAM.
> 
> Dennis
> 
> ML mail wrote:
> > Dear Richard and others interested,
> > 
> > Just wanted to post the results of reducing
> generate.max.per.host to 25 (instead of -1: unlimited) as
> recommended by Richard. 
> > 
> > So actually to summarize, the fetch step has been
> greatly reduced to 1 hour instead of 6 hours (for topN set
> at 25000) but unfortunately the generate step is still quite
> slow and takes around 4 hours (for the same topN amount). 
> > 
> > Is this normal for the generate step to still be so
> slow ? The whole index is only around 170'000 pages big.
> Is there maybe also an option in nutch-default.xml config
> file where one can optimize the generate process ?
> > 
> > Best regards
> > 
> > 
> > 
> > 
> > --- On Fri, 11/21/08, Richard Cyganiak
> <[EMAIL PROTECTED]> wrote:
> > 
> >> From: Richard Cyganiak <[EMAIL PROTECTED]>
> >> Subject: Re: Nutch generate and fetch very slow
> after a few crawls
> >> To: [email protected]
> >> Date: Friday, November 21, 2008, 2:42 AM
> >> On 21 Nov 2008, at 09:47, ML mail wrote:
> >>> What would then be the solution to this
> problem ?
> >> Shall I simply set generate.max.per.host to
> something like 5
> >> ? Or is there another way to make run Nutch at a
> good speed
> >> again ?
> >>
> >> I spent some time trying different values for
> >> generate.max.per.host and I found that this is a
> good rule
> >> of thumb:
> >>
> >> generate.max.per.host = topN / numberOfNodes /
> 1000
> >>
> >> Where topN is the size of your segments,
> numberOfNodes is
> >> the number of machines in your cluser. This keeps
> the fetch
> >> rate close to maximum.
> >>
> >> Check the log of the fetch job -- if the last few
> pages
> >> consist of request to just one or a few hosts,
> then your
> >> value for generate.max.per.host is too large. You
> want to
> >> fetch from many hosts in parallel throughout the
> entire
> >> fetch job. On the other hand, if you set it too
> low, then
> >> you will never make progress on these large sites.
> >>
> >> I fetched the same segment repeatedly to find out
> what
> >> values work best.
> >>
> >> Hope that helps,
> >> Richard
> >>
> >>
> >>>
> >>> Best regards
> >>>
> >>>
> >>> --- On Thu, 11/20/08, Dennis Kubes
> >> <[EMAIL PROTECTED]> wrote:
> >>>> From: Dennis Kubes
> <[EMAIL PROTECTED]>
> >>>> Subject: Re: Nutch generate and fetch very
> slow
> >> after a few crawls
> >>>> To: [email protected]
> >>>> Date: Thursday, November 20, 2008, 2:40 PM
> >>>> Off the top of my head I would guess that
> you hit
> >> a patch of
> >>>> urls all from the same domain and that
> would slow
> >> down
> >>>> fetching on a single host because only one
> thread
> >> would be
> >>>> active?  The generate.max.per.host config
> variable
> >> can limit
> >>>> that.
> >>>>
> >>>> But that is just a guess.  What job is it
> slowing
> >> down on?
> >>>> Yes Nutch will take more time with more
> data, but
> >> that is
> >>>> too much of a difference.
> >>>>
> >>>> Dennis
> >>>>
> >>>> ML mail wrote:
> >>>>> Hello,
> >>>>>
> >>>>> I am currently using the recrawl
> script from
> >> the Nutch
> >>>> Wiki for crawling all websites from a
> specific
> >> small top
> >>>> level domain and have configured the
> recrawl
> >> script to run
> >>>> with THREADS=50, DEPTH=4, TOPN=25000.
> Which means
> >> that each
> >>>> time I run the script 100'000 pages
> will get
> >> crawled.
> >>>>> The first time I ran the script it
> took 6
> >> hours for
> >>>> the whole process with mergesegs,
> inverlinks,
> >> index, merge
> >>>> and so on. The second time it took just 3
> hours
> >> more so 9
> >>>> hours, then the 4th time 12 hours but now
> the
> >> fourth time it
> >>>> is actually still running after 22 hours
> and
> >> it's only
> >>>> at the 64'000 page to be crawled. It
> looks
> >> like that it
> >>>> is especially the fetch step and the index
> step
> >> which are
> >>>> running much more slowly, the other steps
> look
> >> normal.
> >>>>> So is this actually a normal behavior
> of Nutch
> >> ? I
> >>>> would expect Nutch to be each time a tiny
> little
> >> bit more
> >>>> slower due to updating an always growing
> >>>> database/index/segment but never so much
> slower as
> >> I am
> >>>> currently experiencing. Especially when
> right now
> >> there are
> >>>> only 144'915 pages indexed and the
> whole crawl
> >> directly
> >>>> with everything is only around 2 GB big.
> >>>>> Nutch is running on a quite good
> Pentium 4
> >> Xeon
> >>>> computer 2.8 GHz with 1 GB RAM and nothing
> else
> >> mutch
> >>>> running on it, also I didn't change
> much in
> >> the config
> >>>> of Nutch itself so it's pretty much
> default.
> >>>>> Does anyone have an idea ? I can
> provide more
> >> info if
> >>>> you desire, just let me know what you
> need.
> >>>>> Many thanks in advance and best
> regards
> >>>>>
> >>>>>
> >>>
> >>>
> > 
> > 
> >

Re: Nutch generate and fetch very slow after a few crawls (results)

Reply via email to