Re: Nutch generate and fetch very slow after a few crawls (results)

ML mail Fri, 28 Nov 2008 12:49:30 -0800

Thank you for your expertise, I will have to rethink the architecture based on 
your very appreciated help !


Regards



--- On Fri, 11/28/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:

> From: Dennis Kubes <[EMAIL PROTECTED]>
> Subject: Re: Nutch generate and fetch very slow after a few crawls (results)
> To: [email protected]
> Date: Friday, November 28, 2008, 11:35 AM
> ML mail wrote:
> > Dear Denis,
> > 
> > Thanks for pointing the many source of problems on our
> side. I just have a few thoughts about the points you
> mentioned.
> > 
> > 1) I was thinking NFS would be useful because then we
> don't need for example to run the crawler on the same
> machine as the web frontend. The NFS share is in our case
> simply mounted on the crawler machine and on the Tomcat web
> server machine. Also the Linux NFS server is dedicated just
> for this task with Gbit ethernet connection, RAID 6, fast
> Seagate enterprise SATA hard disks totaling to 8 TB of
> storage. So what would be the best architecture for Nutch
> crawling ? One huge very good machine with lots of CPU
> power, lots of memory and lots of hard disk space ? And how
> you do if you want to separate the web server front end from
> the crawler ? 
> > 
> IMHO the best architecture for nutch crawing is a Hadoop
> setup to 
> distribute the load.  If you are using only a single
> machine then I 
> would definitely have the file system be local.
> 
> > 2) I totally agree 512 MB isn't much, that's
> why I do small crawling sessions with only 50 threads for
> the fetcher.
> 
> Just FYI. The fetcher.threads.fetch config variable sets it
> per task. 
> So if you have 10 tasks running on your one machine and 
> fetcher.threads.fetch is set to 50, you would actually have
> 500 threads. 
>    But even so you said your slowdown was in generating not
> fetching.
> 
> > 
> > 3) Our filter is really pretty much the same as the
> default filter. So I don't think it's the filter,
> except if the default filter is no good :-(
> 
> Depends on the URLs being fetched.  I have seen it loop out
> before.  I 
> don't think this is your problem though.  Without
> actually seeing the 
> system I think the problem is 1) NFS 2) memory.
> 
> Dennis
> 
> > 
> > Regards
> > 
> > 
> > --- On Thu, 11/27/08, Dennis Kubes
> <[EMAIL PROTECTED]> wrote:
> > 
> >> From: Dennis Kubes <[EMAIL PROTECTED]>
> >> Subject: Re: Nutch generate and fetch very slow
> after a few crawls (results)
> >> To: [email protected]
> >> Date: Thursday, November 27, 2008, 2:22 PM
> >> I think you have more than one problem going on
> here to
> >> create the situation you are facing.
> >>
> >> 1) The biggest problem I see is the NFS mount. 
> The
> >> generate step is still going to write out
> intermediate
> >> output and is going to be doing lots of read
> input.  With
> >> NFS it is having to do lots of network roundtrips.
>  Even
> >> when using DFS, data is downloaded to local disk
> first
> >> before being processed in MapReduce.
> >>
> >> 2) 512M is low.  I don't think that is the
> biggest
> >> problem but it could definitely be causing some
> slowdown
> >> because of the amount of data that can be read and
> processed
> >> at one time.
> >>
> >> 3) CPU is at 100% and if it stays that way for a
> long time
> >> there may be other issues going on beside just
> processing
> >> data.  These could be filter regex related or
> something
> >> else.  I don't think this is the main problem
> though.
> >>
> >> Dennis
> >>
> >> ML mail wrote:
> >>> Dear Dennis
> >>>
> >>> Here a few lines to explain better our current
> >> configuration: Nutch 0.9 is running in an OpenVZ
> Virtual
> >> Private Server with 512 MB RAM allocated to it and
> full CPU
> >> usage (a Pentium IV Xeon 2.8 GHz, the machine
> itself has 1
> >> GB physical RAM). NUTCH_HEAPSIZE in the nutch
> shell script
> >> is set to 350 MB instead of the original 1000 MB.
> >>> Right now it is running the generate step
> since approx
> >> 3 hours for a topN of 40000 pages and usually it
> takes
> >> between 4 and 5 hours. The command top shows the
> following: 
> >>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU
> %MEM   
> >> TIME+  COMMAND                                    
>          
> >>     16268 nutch     18   0  471m 159m 5808 S  100
> 31.1
> >> 171:03.82 java              
> >>> So it looks like the CPU is the limiting
> factor in our
> >> case, when I interpret this correctly. This is
> quite strange
> >> because a Pentium IV 2.8 GHz is not so old...
> >>> Regarding the swapping, I can't see any
> swapping
> >> in the VPS as there is no swap really in OpenVZ,
> here is the
> >> output of "free -m" during the generate
> step:
> >>>              total       used       free    
> shared   
> >> buffers     cached
> >>> Mem:           512        477         34      
>    0   
> >>       0          0
> >>> -/+ buffers/cache:        477         34
> >>> Swap:            0          0          0
> >>>
> >>> Nutch binaries is stored and running from a
> Linux/MD
> >> software RAID 1 set with two Seagate ATA 7200 rpm
> harddisks.
> >> The crawl directory and all its data is stored on
> a 8 TB
> >> Linux RAID 6 NAS mounted via NFS with the
> following NFS
> >> mount options:
> >>> rw,tcp,rsize=8192,wsize=8192
> >>>
> >>> But if I understand correctly during the
> generate step
> >> of Nutch there is only very low I/O activity
> because it is
> >> only querying the crawldb, am I correct ?
> >>> About the URL filter we use pretty much the
> default
> >> crawl-urlfilter.txt file except that we added
> maybe 2-3 more
> >> extensions to skip and changed the "accept
> hosts"
> >> to only index those ending in .be. You will find
> our
> >> crawl-urlfilter.txt file at the end of this mail.
> >>> So I hope I have provided you enough
> information, if
> >> you need more just let me know.
> >>> Thanks again and best regards
> >>>
> >>>
> >>> ---------- crawl-urlfilter.txt -------
> >>>
> >>> # skip file:, ftp:, & mailto: urls
> >>> -^(file|ftp|mailto):
> >>>
> >>> # skip image and other suffixes we can't
> yet parse
> >>>
> >>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$
> >>> # skip URLs containing certain characters as
> probable
> >> queries, etc.
> >>> [EMAIL PROTECTED]
> >>>
> >>> # skip URLs with slash-delimited segment that
> repeats
> >> 3+ times, to break loops
> >>> -.*(/.+?)/.*?\1/.*?\1/
> >>>
> >>> # accept hosts in MY.DOMAIN.NAME
> >>> +^http://([a-z0-9]*\.).*\.be/
> >>>
> >>> # skip everything else
> >>> -.
> >>>
> >>> ---------- crawl-urlfilter.txt -------
> >>>
> >>>
> >>>
> >>>
> >>> --- On Wed, 11/26/08, Dennis Kubes
> >> <[EMAIL PROTECTED]> wrote:
> >>>> From: Dennis Kubes
> <[EMAIL PROTECTED]>
> >>>> Subject: Re: Nutch generate and fetch very
> slow
> >> after a few crawls (results)
> >>>> To: [email protected]
> >>>> Date: Wednesday, November 26, 2008, 4:46
> AM
> >>>> Well, generate will still have to go
> through all
> >> of the
> >>>> urls, although skipping after 25 per
> domain should
> >> be really quick.  It
> >>>> really depends on your hardware and any
> regexes
> >> you may be running in
> >>>> urlfilters for the generate step.  A
> 2.8Ghz Xeon
> >> with 1G ram should be
> >>>> pretty quick. 100,000 pages on a core2duo
> on my
> >> laptop (4G Ram) takes
> >>>> less than an hour if I remember correctly.
>  What
> >> type of hard drive
> >>>> (speed) are you using and are you swapping
> a lot
> >> during generate?  It may
> >>>> be the amount of RAM.
> >>>>
> >>>> Dennis
> >>>>
> >>>> ML mail wrote:
> >>>>> Dear Richard and others interested,
> >>>>>
> >>>>> Just wanted to post the results of
> reducing
> >>>> generate.max.per.host to 25 (instead of
> -1:
> >> unlimited) as
> >>>> recommended by Richard. 
> >>>>> So actually to summarize, the fetch
> step has
> >> been
> >>>> greatly reduced to 1 hour instead of 6
> hours (for
> >> topN set
> >>>> at 25000) but unfortunately the generate
> step is
> >> still quite
> >>>> slow and takes around 4 hours (for the
> same topN
> >> amount). 
> >>>>> Is this normal for the generate step
> to still
> >> be so
> >>>> slow ? The whole index is only around
> 170'000
> >> pages big.
> >>>> Is there maybe also an option in
> nutch-default.xml
> >> config
> >>>> file where one can optimize the generate
> process ?
> >>>>> Best regards
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --- On Fri, 11/21/08, Richard Cyganiak
> >>>> <[EMAIL PROTECTED]> wrote:
> >>>>>> From: Richard Cyganiak
> >> <[EMAIL PROTECTED]>
> >>>>>> Subject: Re: Nutch generate and
> fetch very
> >> slow
> >>>> after a few crawls
> >>>>>> To: [email protected]
> >>>>>> Date: Friday, November 21, 2008,
> 2:42 AM
> >>>>>> On 21 Nov 2008, at 09:47, ML mail
> wrote:
> >>>>>>> What would then be the
> solution to
> >> this
> >>>> problem ?
> >>>>>> Shall I simply set
> generate.max.per.host
> >> to
> >>>> something like 5
> >>>>>> ? Or is there another way to make
> run
> >> Nutch at a
> >>>> good speed
> >>>>>> again ?
> >>>>>>
> >>>>>> I spent some time trying different
> values
> >> for
> >>>>>> generate.max.per.host and I found
> that
> >> this is a
> >>>> good rule
> >>>>>> of thumb:
> >>>>>>
> >>>>>> generate.max.per.host = topN /
> >> numberOfNodes /
> >>>> 1000
> >>>>>> Where topN is the size of your
> segments,
> >>>> numberOfNodes is
> >>>>>> the number of machines in your
> cluser.
> >> This keeps
> >>>> the fetch
> >>>>>> rate close to maximum.
> >>>>>>
> >>>>>> Check the log of the fetch job --
> if the
> >> last few
> >>>> pages
> >>>>>> consist of request to just one or
> a few
> >> hosts,
> >>>> then your
> >>>>>> value for generate.max.per.host is
> too
> >> large. You
> >>>> want to
> >>>>>> fetch from many hosts in parallel
> >> throughout the
> >>>> entire
> >>>>>> fetch job. On the other hand, if
> you set
> >> it too
> >>>> low, then
> >>>>>> you will never make progress on
> these
> >> large sites.
> >>>>>> I fetched the same segment
> repeatedly to
> >> find out
> >>>> what
> >>>>>> values work best.
> >>>>>>
> >>>>>> Hope that helps,
> >>>>>> Richard
> >>>>>>
> >>>>>>
> >>>>>>> Best regards
> >>>>>>>
> >>>>>>>
> >>>>>>> --- On Thu, 11/20/08, Dennis
> Kubes
> >>>>>> <[EMAIL PROTECTED]> wrote:
> >>>>>>>> From: Dennis Kubes
> >>>> <[EMAIL PROTECTED]>
> >>>>>>>> Subject: Re: Nutch
> generate and
> >> fetch very
> >>>> slow
> >>>>>> after a few crawls
> >>>>>>>> To:
> [email protected]
> >>>>>>>> Date: Thursday, November
> 20, 2008,
> >> 2:40 PM
> >>>>>>>> Off the top of my head I
> would
> >> guess that
> >>>> you hit
> >>>>>> a patch of
> >>>>>>>> urls all from the same
> domain and
> >> that
> >>>> would slow
> >>>>>> down
> >>>>>>>> fetching on a single host
> because
> >> only one
> >>>> thread
> >>>>>> would be
> >>>>>>>> active?  The
> generate.max.per.host
> >> config
> >>>> variable
> >>>>>> can limit
> >>>>>>>> that.
> >>>>>>>>
> >>>>>>>> But that is just a guess. 
> What
> >> job is it
> >>>> slowing
> >>>>>> down on?
> >>>>>>>> Yes Nutch will take more
> time with
> >> more
> >>>> data, but
> >>>>>> that is
> >>>>>>>> too much of a difference.
> >>>>>>>>
> >>>>>>>> Dennis
> >>>>>>>>
> >>>>>>>> ML mail wrote:
> >>>>>>>>> Hello,
> >>>>>>>>>
> >>>>>>>>> I am currently using
> the
> >> recrawl
> >>>> script from
> >>>>>> the Nutch
> >>>>>>>> Wiki for crawling all
> websites
> >> from a
> >>>> specific
> >>>>>> small top
> >>>>>>>> level domain and have
> configured
> >> the
> >>>> recrawl
> >>>>>> script to run
> >>>>>>>> with THREADS=50, DEPTH=4,
> >> TOPN=25000.
> >>>> Which means
> >>>>>> that each
> >>>>>>>> time I run the script
> 100'000
> >> pages
> >>>> will get
> >>>>>> crawled.
> >>>>>>>>> The first time I ran
> the
> >> script it
> >>>> took 6
> >>>>>> hours for
> >>>>>>>> the whole process with
> mergesegs,
> >>>> inverlinks,
> >>>>>> index, merge
> >>>>>>>> and so on. The second time
> it took
> >> just 3
> >>>> hours
> >>>>>> more so 9
> >>>>>>>> hours, then the 4th time
> 12 hours
> >> but now
> >>>> the
> >>>>>> fourth time it
> >>>>>>>> is actually still running
> after 22
> >> hours
> >>>> and
> >>>>>> it's only
> >>>>>>>> at the 64'000 page to
> be
> >> crawled. It
> >>>> looks
> >>>>>> like that it
> >>>>>>>> is especially the fetch
> step and
> >> the index
> >>>> step
> >>>>>> which are
> >>>>>>>> running much more slowly,
> the
> >> other steps
> >>>> look
> >>>>>> normal.
> >>>>>>>>> So is this actually a
> normal
> >> behavior
> >>>> of Nutch
> >>>>>> ? I
> >>>>>>>> would expect Nutch to be
> each time
> >> a tiny
> >>>> little
> >>>>>> bit more
> >>>>>>>> slower due to updating an
> always
> >> growing
> >>>>>>>> database/index/segment but
> never
> >> so much
> >>>> slower as
> >>>>>> I am
> >>>>>>>> currently experiencing.
> Especially
> >> when
> >>>> right now
> >>>>>> there are
> >>>>>>>> only 144'915 pages
> indexed and
> >> the
> >>>> whole crawl
> >>>>>> directly
> >>>>>>>> with everything is only
> around 2
> >> GB big.
> >>>>>>>>> Nutch is running on a
> quite
> >> good
> >>>> Pentium 4
> >>>>>> Xeon
> >>>>>>>> computer 2.8 GHz with 1 GB
> RAM and
> >> nothing
> >>>> else
> >>>>>> mutch
> >>>>>>>> running on it, also I
> didn't
> >> change
> >>>> much in
> >>>>>> the config
> >>>>>>>> of Nutch itself so
> it's pretty
> >> much
> >>>> default.
> >>>>>>>>> Does anyone have an
> idea ? I
> >> can
> >>>> provide more
> >>>>>> info if
> >>>>>>>> you desire, just let me
> know what
> >> you
> >>>> need.
> >>>>>>>>> Many thanks in advance
> and
> >> best
> >>>> regards
> >>>>>
> >>>
> >>>
> > 
> > 
> >       
> >

Re: Nutch generate and fetch very slow after a few crawls (results)

Reply via email to