From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very slow after a few crawls (results)
To: [email protected]
Date: Thursday, November 27, 2008, 2:22 PM
I think you have more than one problem going on here to
create the situation you are facing.
1) The biggest problem I see is the NFS mount. The
generate step is still going to write out intermediate
output and is going to be doing lots of read input. With
NFS it is having to do lots of network roundtrips. Even
when using DFS, data is downloaded to local disk first
before being processed in MapReduce.
2) 512M is low. I don't think that is the biggest
problem but it could definitely be causing some slowdown
because of the amount of data that can be read and processed
at one time.
3) CPU is at 100% and if it stays that way for a long time
there may be other issues going on beside just processing
data. These could be filter regex related or something
else. I don't think this is the main problem though.
Dennis
ML mail wrote:
Dear Dennis
Here a few lines to explain better our current
configuration: Nutch 0.9 is running in an OpenVZ Virtual
Private Server with 512 MB RAM allocated to it and full CPU
usage (a Pentium IV Xeon 2.8 GHz, the machine itself has 1
GB physical RAM). NUTCH_HEAPSIZE in the nutch shell script
is set to 350 MB instead of the original 1000 MB.
Right now it is running the generate step since approx
3 hours for a topN of 40000 pages and usually it takes
between 4 and 5 hours. The command top shows the following:
PID USER PR NI VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
16268 nutch 18 0 471m 159m 5808 S 100 31.1
171:03.82 java
So it looks like the CPU is the limiting factor in our
case, when I interpret this correctly. This is quite strange
because a Pentium IV 2.8 GHz is not so old...
Regarding the swapping, I can't see any swapping
in the VPS as there is no swap really in OpenVZ, here is the
output of "free -m" during the generate step:
total used free shared
buffers cached
Mem: 512 477 34 0
0 0
-/+ buffers/cache: 477 34
Swap: 0 0 0
Nutch binaries is stored and running from a Linux/MD
software RAID 1 set with two Seagate ATA 7200 rpm harddisks.
The crawl directory and all its data is stored on a 8 TB
Linux RAID 6 NAS mounted via NFS with the following NFS
mount options:
rw,tcp,rsize=8192,wsize=8192
But if I understand correctly during the generate step
of Nutch there is only very low I/O activity because it is
only querying the crawldb, am I correct ?
About the URL filter we use pretty much the default
crawl-urlfilter.txt file except that we added maybe 2-3 more
extensions to skip and changed the "accept hosts"
to only index those ending in .be. You will find our
crawl-urlfilter.txt file at the end of this mail.
So I hope I have provided you enough information, if
you need more just let me know.
Thanks again and best regards
---------- crawl-urlfilter.txt -------
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$
# skip URLs containing certain characters as probable
queries, etc.
[EMAIL PROTECTED]
# skip URLs with slash-delimited segment that repeats
3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.).*\.be/
# skip everything else
-.
---------- crawl-urlfilter.txt -------
--- On Wed, 11/26/08, Dennis Kubes
<[EMAIL PROTECTED]> wrote:
From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very slow
after a few crawls (results)
To: [email protected]
Date: Wednesday, November 26, 2008, 4:46 AM
Well, generate will still have to go through all
of the
urls, although skipping after 25 per domain should
be really quick. It
really depends on your hardware and any regexes
you may be running in
urlfilters for the generate step. A 2.8Ghz Xeon
with 1G ram should be
pretty quick. 100,000 pages on a core2duo on my
laptop (4G Ram) takes
less than an hour if I remember correctly. What
type of hard drive
(speed) are you using and are you swapping a lot
during generate? It may
be the amount of RAM.
Dennis
ML mail wrote:
Dear Richard and others interested,
Just wanted to post the results of reducing
generate.max.per.host to 25 (instead of -1:
unlimited) as
recommended by Richard.
So actually to summarize, the fetch step has
been
greatly reduced to 1 hour instead of 6 hours (for
topN set
at 25000) but unfortunately the generate step is
still quite
slow and takes around 4 hours (for the same topN
amount).
Is this normal for the generate step to still
be so
slow ? The whole index is only around 170'000
pages big.
Is there maybe also an option in nutch-default.xml
config
file where one can optimize the generate process ?
Best regards
--- On Fri, 11/21/08, Richard Cyganiak
<[EMAIL PROTECTED]> wrote:
From: Richard Cyganiak
<[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very
slow
after a few crawls
To: [email protected]
Date: Friday, November 21, 2008, 2:42 AM
On 21 Nov 2008, at 09:47, ML mail wrote:
What would then be the solution to
this
problem ?
Shall I simply set generate.max.per.host
to
something like 5
? Or is there another way to make run
Nutch at a
good speed
again ?
I spent some time trying different values
for
generate.max.per.host and I found that
this is a
good rule
of thumb:
generate.max.per.host = topN /
numberOfNodes /
1000
Where topN is the size of your segments,
numberOfNodes is
the number of machines in your cluser.
This keeps
the fetch
rate close to maximum.
Check the log of the fetch job -- if the
last few
pages
consist of request to just one or a few
hosts,
then your
value for generate.max.per.host is too
large. You
want to
fetch from many hosts in parallel
throughout the
entire
fetch job. On the other hand, if you set
it too
low, then
you will never make progress on these
large sites.
I fetched the same segment repeatedly to
find out
what
values work best.
Hope that helps,
Richard
Best regards
--- On Thu, 11/20/08, Dennis Kubes
<[EMAIL PROTECTED]> wrote:
From: Dennis Kubes
<[EMAIL PROTECTED]>
Subject: Re: Nutch generate and
fetch very
slow
after a few crawls
To: [email protected]
Date: Thursday, November 20, 2008,
2:40 PM
Off the top of my head I would
guess that
you hit
a patch of
urls all from the same domain and
that
would slow
down
fetching on a single host because
only one
thread
would be
active? The generate.max.per.host
config
variable
can limit
that.
But that is just a guess. What
job is it
slowing
down on?
Yes Nutch will take more time with
more
data, but
that is
too much of a difference.
Dennis
ML mail wrote:
Hello,
I am currently using the
recrawl
script from
the Nutch
Wiki for crawling all websites
from a
specific
small top
level domain and have configured
the
recrawl
script to run
with THREADS=50, DEPTH=4,
TOPN=25000.
Which means
that each
time I run the script 100'000
pages
will get
crawled.
The first time I ran the
script it
took 6
hours for
the whole process with mergesegs,
inverlinks,
index, merge
and so on. The second time it took
just 3
hours
more so 9
hours, then the 4th time 12 hours
but now
the
fourth time it
is actually still running after 22
hours
and
it's only
at the 64'000 page to be
crawled. It
looks
like that it
is especially the fetch step and
the index
step
which are
running much more slowly, the
other steps
look
normal.
So is this actually a normal
behavior
of Nutch
? I
would expect Nutch to be each time
a tiny
little
bit more
slower due to updating an always
growing
database/index/segment but never
so much
slower as
I am
currently experiencing. Especially
when
right now
there are
only 144'915 pages indexed and
the
whole crawl
directly
with everything is only around 2
GB big.
Nutch is running on a quite
good
Pentium 4
Xeon
computer 2.8 GHz with 1 GB RAM and
nothing
else
mutch
running on it, also I didn't
change
much in
the config
of Nutch itself so it's pretty
much
default.
Does anyone have an idea ? I
can
provide more
info if
you desire, just let me know what
you
need.
Many thanks in advance and
best
regards