I think you have more than one problem going on here to create the
situation you are facing.
1) The biggest problem I see is the NFS mount. The generate step is
still going to write out intermediate output and is going to be doing
lots of read input. With NFS it is having to do lots of network
roundtrips. Even when using DFS, data is downloaded to local disk first
before being processed in MapReduce.
2) 512M is low. I don't think that is the biggest problem but it could
definitely be causing some slowdown because of the amount of data that
can be read and processed at one time.
3) CPU is at 100% and if it stays that way for a long time there may be
other issues going on beside just processing data. These could be
filter regex related or something else. I don't think this is the main
problem though.
Dennis
ML mail wrote:
Dear Dennis
Here a few lines to explain better our current configuration: Nutch 0.9 is
running in an OpenVZ Virtual Private Server with 512 MB RAM allocated to it and
full CPU usage (a Pentium IV Xeon 2.8 GHz, the machine itself has 1 GB physical
RAM). NUTCH_HEAPSIZE in the nutch shell script is set to 350 MB instead of the
original 1000 MB.
Right now it is running the generate step since approx 3 hours for a topN of 40000 pages and usually it takes between 4 and 5 hours. The command top shows the following:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16268 nutch 18 0 471m 159m 5808 S 100 31.1 171:03.82 java
So it looks like the CPU is the limiting factor in our case, when I interpret
this correctly. This is quite strange because a Pentium IV 2.8 GHz is not so
old...
Regarding the swapping, I can't see any swapping in the VPS as there is no swap really in
OpenVZ, here is the output of "free -m" during the generate step:
total used free shared buffers cached
Mem: 512 477 34 0 0 0
-/+ buffers/cache: 477 34
Swap: 0 0 0
Nutch binaries is stored and running from a Linux/MD software RAID 1 set with
two Seagate ATA 7200 rpm harddisks. The crawl directory and all its data is
stored on a 8 TB Linux RAID 6 NAS mounted via NFS with the following NFS mount
options:
rw,tcp,rsize=8192,wsize=8192
But if I understand correctly during the generate step of Nutch there is only
very low I/O activity because it is only querying the crawldb, am I correct ?
About the URL filter we use pretty much the default crawl-urlfilter.txt file except that
we added maybe 2-3 more extensions to skip and changed the "accept hosts" to
only index those ending in .be. You will find our crawl-urlfilter.txt file at the end of
this mail.
So I hope I have provided you enough information, if you need more just let me
know.
Thanks again and best regards
---------- crawl-urlfilter.txt -------
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.).*\.be/
# skip everything else
-.
---------- crawl-urlfilter.txt -------
--- On Wed, 11/26/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:
From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very slow after a few crawls (results)
To: [email protected]
Date: Wednesday, November 26, 2008, 4:46 AM
Well, generate will still have to go through all of the
urls, although
skipping after 25 per domain should be really quick. It
really depends
on your hardware and any regexes you may be running in
urlfilters for
the generate step. A 2.8Ghz Xeon with 1G ram should be
pretty quick.
100,000 pages on a core2duo on my laptop (4G Ram) takes
less than an
hour if I remember correctly. What type of hard drive
(speed) are you
using and are you swapping a lot during generate? It may
be the amount
of RAM.
Dennis
ML mail wrote:
Dear Richard and others interested,
Just wanted to post the results of reducing
generate.max.per.host to 25 (instead of -1: unlimited) as
recommended by Richard.
So actually to summarize, the fetch step has been
greatly reduced to 1 hour instead of 6 hours (for topN set
at 25000) but unfortunately the generate step is still quite
slow and takes around 4 hours (for the same topN amount).
Is this normal for the generate step to still be so
slow ? The whole index is only around 170'000 pages big.
Is there maybe also an option in nutch-default.xml config
file where one can optimize the generate process ?
Best regards
--- On Fri, 11/21/08, Richard Cyganiak
<[EMAIL PROTECTED]> wrote:
From: Richard Cyganiak <[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very slow
after a few crawls
To: [email protected]
Date: Friday, November 21, 2008, 2:42 AM
On 21 Nov 2008, at 09:47, ML mail wrote:
What would then be the solution to this
problem ?
Shall I simply set generate.max.per.host to
something like 5
? Or is there another way to make run Nutch at a
good speed
again ?
I spent some time trying different values for
generate.max.per.host and I found that this is a
good rule
of thumb:
generate.max.per.host = topN / numberOfNodes /
1000
Where topN is the size of your segments,
numberOfNodes is
the number of machines in your cluser. This keeps
the fetch
rate close to maximum.
Check the log of the fetch job -- if the last few
pages
consist of request to just one or a few hosts,
then your
value for generate.max.per.host is too large. You
want to
fetch from many hosts in parallel throughout the
entire
fetch job. On the other hand, if you set it too
low, then
you will never make progress on these large sites.
I fetched the same segment repeatedly to find out
what
values work best.
Hope that helps,
Richard
Best regards
--- On Thu, 11/20/08, Dennis Kubes
<[EMAIL PROTECTED]> wrote:
From: Dennis Kubes
<[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very
slow
after a few crawls
To: [email protected]
Date: Thursday, November 20, 2008, 2:40 PM
Off the top of my head I would guess that
you hit
a patch of
urls all from the same domain and that
would slow
down
fetching on a single host because only one
thread
would be
active? The generate.max.per.host config
variable
can limit
that.
But that is just a guess. What job is it
slowing
down on?
Yes Nutch will take more time with more
data, but
that is
too much of a difference.
Dennis
ML mail wrote:
Hello,
I am currently using the recrawl
script from
the Nutch
Wiki for crawling all websites from a
specific
small top
level domain and have configured the
recrawl
script to run
with THREADS=50, DEPTH=4, TOPN=25000.
Which means
that each
time I run the script 100'000 pages
will get
crawled.
The first time I ran the script it
took 6
hours for
the whole process with mergesegs,
inverlinks,
index, merge
and so on. The second time it took just 3
hours
more so 9
hours, then the 4th time 12 hours but now
the
fourth time it
is actually still running after 22 hours
and
it's only
at the 64'000 page to be crawled. It
looks
like that it
is especially the fetch step and the index
step
which are
running much more slowly, the other steps
look
normal.
So is this actually a normal behavior
of Nutch
? I
would expect Nutch to be each time a tiny
little
bit more
slower due to updating an always growing
database/index/segment but never so much
slower as
I am
currently experiencing. Especially when
right now
there are
only 144'915 pages indexed and the
whole crawl
directly
with everything is only around 2 GB big.
Nutch is running on a quite good
Pentium 4
Xeon
computer 2.8 GHz with 1 GB RAM and nothing
else
mutch
running on it, also I didn't change
much in
the config
of Nutch itself so it's pretty much
default.
Does anyone have an idea ? I can
provide more
info if
you desire, just let me know what you
need.
Many thanks in advance and best
regards