Dear Dennis
Here a few lines to explain better our current configuration: Nutch 0.9 is
running in an OpenVZ Virtual Private Server with 512 MB RAM allocated to it and
full CPU usage (a Pentium IV Xeon 2.8 GHz, the machine itself has 1 GB physical
RAM). NUTCH_HEAPSIZE in the nutch shell script is set to 350 MB instead of the
original 1000 MB.
Right now it is running the generate step since approx 3 hours for a topN of
40000 pages and usually it takes between 4 and 5 hours. The command top shows
the following:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16268 nutch 18 0 471m 159m 5808 S 100 31.1 171:03.82 java
So it looks like the CPU is the limiting factor in our case, when I interpret
this correctly. This is quite strange because a Pentium IV 2.8 GHz is not so
old...
Regarding the swapping, I can't see any swapping in the VPS as there is no swap
really in OpenVZ, here is the output of "free -m" during the generate step:
total used free shared buffers cached
Mem: 512 477 34 0 0 0
-/+ buffers/cache: 477 34
Swap: 0 0 0
Nutch binaries is stored and running from a Linux/MD software RAID 1 set with
two Seagate ATA 7200 rpm harddisks. The crawl directory and all its data is
stored on a 8 TB Linux RAID 6 NAS mounted via NFS with the following NFS mount
options:
rw,tcp,rsize=8192,wsize=8192
But if I understand correctly during the generate step of Nutch there is only
very low I/O activity because it is only querying the crawldb, am I correct ?
About the URL filter we use pretty much the default crawl-urlfilter.txt file
except that we added maybe 2-3 more extensions to skip and changed the "accept
hosts" to only index those ending in .be. You will find our crawl-urlfilter.txt
file at the end of this mail.
So I hope I have provided you enough information, if you need more just let me
know.
Thanks again and best regards
---------- crawl-urlfilter.txt -------
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.).*\.be/
# skip everything else
-.
---------- crawl-urlfilter.txt -------
--- On Wed, 11/26/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> From: Dennis Kubes <[EMAIL PROTECTED]>
> Subject: Re: Nutch generate and fetch very slow after a few crawls (results)
> To: [email protected]
> Date: Wednesday, November 26, 2008, 4:46 AM
> Well, generate will still have to go through all of the
> urls, although
> skipping after 25 per domain should be really quick. It
> really depends
> on your hardware and any regexes you may be running in
> urlfilters for
> the generate step. A 2.8Ghz Xeon with 1G ram should be
> pretty quick.
> 100,000 pages on a core2duo on my laptop (4G Ram) takes
> less than an
> hour if I remember correctly. What type of hard drive
> (speed) are you
> using and are you swapping a lot during generate? It may
> be the amount
> of RAM.
>
> Dennis
>
> ML mail wrote:
> > Dear Richard and others interested,
> >
> > Just wanted to post the results of reducing
> generate.max.per.host to 25 (instead of -1: unlimited) as
> recommended by Richard.
> >
> > So actually to summarize, the fetch step has been
> greatly reduced to 1 hour instead of 6 hours (for topN set
> at 25000) but unfortunately the generate step is still quite
> slow and takes around 4 hours (for the same topN amount).
> >
> > Is this normal for the generate step to still be so
> slow ? The whole index is only around 170'000 pages big.
> Is there maybe also an option in nutch-default.xml config
> file where one can optimize the generate process ?
> >
> > Best regards
> >
> >
> >
> >
> > --- On Fri, 11/21/08, Richard Cyganiak
> <[EMAIL PROTECTED]> wrote:
> >
> >> From: Richard Cyganiak <[EMAIL PROTECTED]>
> >> Subject: Re: Nutch generate and fetch very slow
> after a few crawls
> >> To: [email protected]
> >> Date: Friday, November 21, 2008, 2:42 AM
> >> On 21 Nov 2008, at 09:47, ML mail wrote:
> >>> What would then be the solution to this
> problem ?
> >> Shall I simply set generate.max.per.host to
> something like 5
> >> ? Or is there another way to make run Nutch at a
> good speed
> >> again ?
> >>
> >> I spent some time trying different values for
> >> generate.max.per.host and I found that this is a
> good rule
> >> of thumb:
> >>
> >> generate.max.per.host = topN / numberOfNodes /
> 1000
> >>
> >> Where topN is the size of your segments,
> numberOfNodes is
> >> the number of machines in your cluser. This keeps
> the fetch
> >> rate close to maximum.
> >>
> >> Check the log of the fetch job -- if the last few
> pages
> >> consist of request to just one or a few hosts,
> then your
> >> value for generate.max.per.host is too large. You
> want to
> >> fetch from many hosts in parallel throughout the
> entire
> >> fetch job. On the other hand, if you set it too
> low, then
> >> you will never make progress on these large sites.
> >>
> >> I fetched the same segment repeatedly to find out
> what
> >> values work best.
> >>
> >> Hope that helps,
> >> Richard
> >>
> >>
> >>>
> >>> Best regards
> >>>
> >>>
> >>> --- On Thu, 11/20/08, Dennis Kubes
> >> <[EMAIL PROTECTED]> wrote:
> >>>> From: Dennis Kubes
> <[EMAIL PROTECTED]>
> >>>> Subject: Re: Nutch generate and fetch very
> slow
> >> after a few crawls
> >>>> To: [email protected]
> >>>> Date: Thursday, November 20, 2008, 2:40 PM
> >>>> Off the top of my head I would guess that
> you hit
> >> a patch of
> >>>> urls all from the same domain and that
> would slow
> >> down
> >>>> fetching on a single host because only one
> thread
> >> would be
> >>>> active? The generate.max.per.host config
> variable
> >> can limit
> >>>> that.
> >>>>
> >>>> But that is just a guess. What job is it
> slowing
> >> down on?
> >>>> Yes Nutch will take more time with more
> data, but
> >> that is
> >>>> too much of a difference.
> >>>>
> >>>> Dennis
> >>>>
> >>>> ML mail wrote:
> >>>>> Hello,
> >>>>>
> >>>>> I am currently using the recrawl
> script from
> >> the Nutch
> >>>> Wiki for crawling all websites from a
> specific
> >> small top
> >>>> level domain and have configured the
> recrawl
> >> script to run
> >>>> with THREADS=50, DEPTH=4, TOPN=25000.
> Which means
> >> that each
> >>>> time I run the script 100'000 pages
> will get
> >> crawled.
> >>>>> The first time I ran the script it
> took 6
> >> hours for
> >>>> the whole process with mergesegs,
> inverlinks,
> >> index, merge
> >>>> and so on. The second time it took just 3
> hours
> >> more so 9
> >>>> hours, then the 4th time 12 hours but now
> the
> >> fourth time it
> >>>> is actually still running after 22 hours
> and
> >> it's only
> >>>> at the 64'000 page to be crawled. It
> looks
> >> like that it
> >>>> is especially the fetch step and the index
> step
> >> which are
> >>>> running much more slowly, the other steps
> look
> >> normal.
> >>>>> So is this actually a normal behavior
> of Nutch
> >> ? I
> >>>> would expect Nutch to be each time a tiny
> little
> >> bit more
> >>>> slower due to updating an always growing
> >>>> database/index/segment but never so much
> slower as
> >> I am
> >>>> currently experiencing. Especially when
> right now
> >> there are
> >>>> only 144'915 pages indexed and the
> whole crawl
> >> directly
> >>>> with everything is only around 2 GB big.
> >>>>> Nutch is running on a quite good
> Pentium 4
> >> Xeon
> >>>> computer 2.8 GHz with 1 GB RAM and nothing
> else
> >> mutch
> >>>> running on it, also I didn't change
> much in
> >> the config
> >>>> of Nutch itself so it's pretty much
> default.
> >>>>> Does anyone have an idea ? I can
> provide more
> >> info if
> >>>> you desire, just let me know what you
> need.
> >>>>> Many thanks in advance and best
> regards
> >>>>>
> >>>>>
> >>>
> >>>
> >
> >
> >