Re: Nutch generate and fetch very slow after a few crawls (results)

ML mail Fri, 28 Nov 2008 06:37:38 -0800

Dear Denis,

Thanks for pointing the many source of problems on our side. I just have a few 
thoughts about the points you mentioned.


1) I was thinking NFS would be useful because then we don't need for example to 
run the crawler on the same machine as the web frontend. The NFS share is in 
our case simply mounted on the crawler machine and on the Tomcat web server 
machine. Also the Linux NFS server is dedicated just for this task with Gbit 
ethernet connection, RAID 6, fast Seagate enterprise SATA hard disks totaling 
to 8 TB of storage. So what would be the best architecture for Nutch crawling ? 
One huge very good machine with lots of CPU power, lots of memory and lots of 
hard disk space ? And how you do if you want to separate the web server front 
end from the crawler ? 

2) I totally agree 512 MB isn't much, that's why I do small crawling sessions 
with only 50 threads for the fetcher.

3) Our filter is really pretty much the same as the default filter. So I don't 
think it's the filter, except if the default filter is no good :-(

Regards


--- On Thu, 11/27/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:

> From: Dennis Kubes <[EMAIL PROTECTED]>
> Subject: Re: Nutch generate and fetch very slow after a few crawls (results)
> To: [email protected]
> Date: Thursday, November 27, 2008, 2:22 PM
> I think you have more than one problem going on here to
> create the situation you are facing.
> 
> 1) The biggest problem I see is the NFS mount.  The
> generate step is still going to write out intermediate
> output and is going to be doing lots of read input.  With
> NFS it is having to do lots of network roundtrips.  Even
> when using DFS, data is downloaded to local disk first
> before being processed in MapReduce.
> 
> 2) 512M is low.  I don't think that is the biggest
> problem but it could definitely be causing some slowdown
> because of the amount of data that can be read and processed
> at one time.
> 
> 3) CPU is at 100% and if it stays that way for a long time
> there may be other issues going on beside just processing
> data.  These could be filter regex related or something
> else.  I don't think this is the main problem though.
> 
> Dennis
> 
> ML mail wrote:
> > Dear Dennis
> > 
> > Here a few lines to explain better our current
> configuration: Nutch 0.9 is running in an OpenVZ Virtual
> Private Server with 512 MB RAM allocated to it and full CPU
> usage (a Pentium IV Xeon 2.8 GHz, the machine itself has 1
> GB physical RAM). NUTCH_HEAPSIZE in the nutch shell script
> is set to 350 MB instead of the original 1000 MB.
> > 
> > Right now it is running the generate step since approx
> 3 hours for a topN of 40000 pages and usually it takes
> between 4 and 5 hours. The command top shows the following: 
> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM   
> TIME+  COMMAND                                              
>     16268 nutch     18   0  471m 159m 5808 S  100 31.1
> 171:03.82 java              
> > So it looks like the CPU is the limiting factor in our
> case, when I interpret this correctly. This is quite strange
> because a Pentium IV 2.8 GHz is not so old...
> > 
> > Regarding the swapping, I can't see any swapping
> in the VPS as there is no swap really in OpenVZ, here is the
> output of "free -m" during the generate step:
> > 
> >              total       used       free     shared   
> buffers     cached
> > Mem:           512        477         34          0   
>       0          0
> > -/+ buffers/cache:        477         34
> > Swap:            0          0          0
> > 
> > Nutch binaries is stored and running from a Linux/MD
> software RAID 1 set with two Seagate ATA 7200 rpm harddisks.
> The crawl directory and all its data is stored on a 8 TB
> Linux RAID 6 NAS mounted via NFS with the following NFS
> mount options:
> > 
> > rw,tcp,rsize=8192,wsize=8192
> > 
> > But if I understand correctly during the generate step
> of Nutch there is only very low I/O activity because it is
> only querying the crawldb, am I correct ?
> > 
> > About the URL filter we use pretty much the default
> crawl-urlfilter.txt file except that we added maybe 2-3 more
> extensions to skip and changed the "accept hosts"
> to only index those ending in .be. You will find our
> crawl-urlfilter.txt file at the end of this mail.
> > 
> > So I hope I have provided you enough information, if
> you need more just let me know.
> > 
> > Thanks again and best regards
> > 
> > 
> > ---------- crawl-urlfilter.txt -------
> > 
> > # skip file:, ftp:, & mailto: urls
> > -^(file|ftp|mailto):
> > 
> > # skip image and other suffixes we can't yet parse
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$
> > 
> > # skip URLs containing certain characters as probable
> queries, etc.
> > [EMAIL PROTECTED]
> > 
> > # skip URLs with slash-delimited segment that repeats
> 3+ times, to break loops
> > -.*(/.+?)/.*?\1/.*?\1/
> > 
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-z0-9]*\.).*\.be/
> > 
> > # skip everything else
> > -.
> > 
> > ---------- crawl-urlfilter.txt -------
> > 
> > 
> > 
> > 
> > --- On Wed, 11/26/08, Dennis Kubes
> <[EMAIL PROTECTED]> wrote:
> > 
> >> From: Dennis Kubes <[EMAIL PROTECTED]>
> >> Subject: Re: Nutch generate and fetch very slow
> after a few crawls (results)
> >> To: [email protected]
> >> Date: Wednesday, November 26, 2008, 4:46 AM
> >> Well, generate will still have to go through all
> of the
> >> urls, although skipping after 25 per domain should
> be really quick.  It
> >> really depends on your hardware and any regexes
> you may be running in
> >> urlfilters for the generate step.  A 2.8Ghz Xeon
> with 1G ram should be
> >> pretty quick. 100,000 pages on a core2duo on my
> laptop (4G Ram) takes
> >> less than an hour if I remember correctly.  What
> type of hard drive
> >> (speed) are you using and are you swapping a lot
> during generate?  It may
> >> be the amount of RAM.
> >> 
> >> Dennis
> >> 
> >> ML mail wrote:
> >>> Dear Richard and others interested,
> >>> 
> >>> Just wanted to post the results of reducing
> >> generate.max.per.host to 25 (instead of -1:
> unlimited) as
> >> recommended by Richard. 
> >>> So actually to summarize, the fetch step has
> been
> >> greatly reduced to 1 hour instead of 6 hours (for
> topN set
> >> at 25000) but unfortunately the generate step is
> still quite
> >> slow and takes around 4 hours (for the same topN
> amount). 
> >>> Is this normal for the generate step to still
> be so
> >> slow ? The whole index is only around 170'000
> pages big.
> >> Is there maybe also an option in nutch-default.xml
> config
> >> file where one can optimize the generate process ?
> >>> Best regards
> >>> 
> >>> 
> >>> 
> >>> 
> >>> --- On Fri, 11/21/08, Richard Cyganiak
> >> <[EMAIL PROTECTED]> wrote:
> >>>> From: Richard Cyganiak
> <[EMAIL PROTECTED]>
> >>>> Subject: Re: Nutch generate and fetch very
> slow
> >> after a few crawls
> >>>> To: [email protected]
> >>>> Date: Friday, November 21, 2008, 2:42 AM
> >>>> On 21 Nov 2008, at 09:47, ML mail wrote:
> >>>>> What would then be the solution to
> this
> >> problem ?
> >>>> Shall I simply set generate.max.per.host
> to
> >> something like 5
> >>>> ? Or is there another way to make run
> Nutch at a
> >> good speed
> >>>> again ?
> >>>> 
> >>>> I spent some time trying different values
> for
> >>>> generate.max.per.host and I found that
> this is a
> >> good rule
> >>>> of thumb:
> >>>> 
> >>>> generate.max.per.host = topN /
> numberOfNodes /
> >> 1000
> >>>> Where topN is the size of your segments,
> >> numberOfNodes is
> >>>> the number of machines in your cluser.
> This keeps
> >> the fetch
> >>>> rate close to maximum.
> >>>> 
> >>>> Check the log of the fetch job -- if the
> last few
> >> pages
> >>>> consist of request to just one or a few
> hosts,
> >> then your
> >>>> value for generate.max.per.host is too
> large. You
> >> want to
> >>>> fetch from many hosts in parallel
> throughout the
> >> entire
> >>>> fetch job. On the other hand, if you set
> it too
> >> low, then
> >>>> you will never make progress on these
> large sites.
> >>>> 
> >>>> I fetched the same segment repeatedly to
> find out
> >> what
> >>>> values work best.
> >>>> 
> >>>> Hope that helps,
> >>>> Richard
> >>>> 
> >>>> 
> >>>>> Best regards
> >>>>> 
> >>>>> 
> >>>>> --- On Thu, 11/20/08, Dennis Kubes
> >>>> <[EMAIL PROTECTED]> wrote:
> >>>>>> From: Dennis Kubes
> >> <[EMAIL PROTECTED]>
> >>>>>> Subject: Re: Nutch generate and
> fetch very
> >> slow
> >>>> after a few crawls
> >>>>>> To: [email protected]
> >>>>>> Date: Thursday, November 20, 2008,
> 2:40 PM
> >>>>>> Off the top of my head I would
> guess that
> >> you hit
> >>>> a patch of
> >>>>>> urls all from the same domain and
> that
> >> would slow
> >>>> down
> >>>>>> fetching on a single host because
> only one
> >> thread
> >>>> would be
> >>>>>> active?  The generate.max.per.host
> config
> >> variable
> >>>> can limit
> >>>>>> that.
> >>>>>> 
> >>>>>> But that is just a guess.  What
> job is it
> >> slowing
> >>>> down on?
> >>>>>> Yes Nutch will take more time with
> more
> >> data, but
> >>>> that is
> >>>>>> too much of a difference.
> >>>>>> 
> >>>>>> Dennis
> >>>>>> 
> >>>>>> ML mail wrote:
> >>>>>>> Hello,
> >>>>>>> 
> >>>>>>> I am currently using the
> recrawl
> >> script from
> >>>> the Nutch
> >>>>>> Wiki for crawling all websites
> from a
> >> specific
> >>>> small top
> >>>>>> level domain and have configured
> the
> >> recrawl
> >>>> script to run
> >>>>>> with THREADS=50, DEPTH=4,
> TOPN=25000.
> >> Which means
> >>>> that each
> >>>>>> time I run the script 100'000
> pages
> >> will get
> >>>> crawled.
> >>>>>>> The first time I ran the
> script it
> >> took 6
> >>>> hours for
> >>>>>> the whole process with mergesegs,
> >> inverlinks,
> >>>> index, merge
> >>>>>> and so on. The second time it took
> just 3
> >> hours
> >>>> more so 9
> >>>>>> hours, then the 4th time 12 hours
> but now
> >> the
> >>>> fourth time it
> >>>>>> is actually still running after 22
> hours
> >> and
> >>>> it's only
> >>>>>> at the 64'000 page to be
> crawled. It
> >> looks
> >>>> like that it
> >>>>>> is especially the fetch step and
> the index
> >> step
> >>>> which are
> >>>>>> running much more slowly, the
> other steps
> >> look
> >>>> normal.
> >>>>>>> So is this actually a normal
> behavior
> >> of Nutch
> >>>> ? I
> >>>>>> would expect Nutch to be each time
> a tiny
> >> little
> >>>> bit more
> >>>>>> slower due to updating an always
> growing
> >>>>>> database/index/segment but never
> so much
> >> slower as
> >>>> I am
> >>>>>> currently experiencing. Especially
> when
> >> right now
> >>>> there are
> >>>>>> only 144'915 pages indexed and
> the
> >> whole crawl
> >>>> directly
> >>>>>> with everything is only around 2
> GB big.
> >>>>>>> Nutch is running on a quite
> good
> >> Pentium 4
> >>>> Xeon
> >>>>>> computer 2.8 GHz with 1 GB RAM and
> nothing
> >> else
> >>>> mutch
> >>>>>> running on it, also I didn't
> change
> >> much in
> >>>> the config
> >>>>>> of Nutch itself so it's pretty
> much
> >> default.
> >>>>>>> Does anyone have an idea ? I
> can
> >> provide more
> >>>> info if
> >>>>>> you desire, just let me know what
> you
> >> need.
> >>>>>>> Many thanks in advance and
> best
> >> regards
> >>>>>>> 
> >>>>> 
> >>> 
> >>> 
> > 
> > 
> >

Re: Nutch generate and fetch very slow after a few crawls (results)

Reply via email to