Re: Nutch generate and fetch very slow after a few crawls (results)

Dennis Kubes Fri, 28 Nov 2008 11:36:18 -0800


ML mail wrote:

Dear Denis,

Thanks for pointing the many source of problems on our side. I just have a few 
thoughts about the points you mentioned.
1) I was thinking NFS would be useful because then we don't need for example to run the crawler on the same machine as the web frontend. The NFS share is in our case simply mounted on the crawler machine and on the Tomcat web server machine. Also the Linux NFS server is dedicated just for this task with Gbit ethernet connection, RAID 6, fast Seagate enterprise SATA hard disks totaling to 8 TB of storage. So what would be the best architecture for Nutch crawling ? One huge very good machine with lots of CPU power, lots of memory and lots of hard disk space ? And how you do if you want to separate the web server front end from the crawler ?

IMHO the best architecture for nutch crawing is a Hadoop setup todistribute the load. If you are using only a single machine then Iwould definitely have the file system be local.

2) I totally agree 512 MB isn't much, that's why I do small crawling sessions 
with only 50 threads for the fetcher.

Just FYI. The fetcher.threads.fetch config variable sets it per task.So if you have 10 tasks running on your one machine andfetcher.threads.fetch is set to 50, you would actually have 500 threads.But even so you said your slowdown was in generating not fetching.


3) Our filter is really pretty much the same as the default filter. So I don't 
think it's the filter, except if the default filter is no good :-(

Depends on the URLs being fetched. I have seen it loop out before. Idon't think this is your problem though. Without actually seeing thesystem I think the problem is 1) NFS 2) memory.


Dennis


Regards


--- On Thu, 11/27/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:

From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very slow after a few crawls (results)
To: [email protected]
Date: Thursday, November 27, 2008, 2:22 PM
I think you have more than one problem going on here to
create the situation you are facing.

1) The biggest problem I see is the NFS mount.  The
generate step is still going to write out intermediate
output and is going to be doing lots of read input.  With
NFS it is having to do lots of network roundtrips.  Even
when using DFS, data is downloaded to local disk first
before being processed in MapReduce.

2) 512M is low.  I don't think that is the biggest
problem but it could definitely be causing some slowdown
because of the amount of data that can be read and processed
at one time.

3) CPU is at 100% and if it stays that way for a long time
there may be other issues going on beside just processing
data.  These could be filter regex related or something
else.  I don't think this is the main problem though.

Dennis

ML mail wrote:

Dear Dennis

Here a few lines to explain better our current

configuration: Nutch 0.9 is running in an OpenVZ Virtual
Private Server with 512 MB RAM allocated to it and full CPU
usage (a Pentium IV Xeon 2.8 GHz, the machine itself has 1
GB physical RAM). NUTCH_HEAPSIZE in the nutch shell script
is set to 350 MB instead of the original 1000 MB.

Right now it is running the generate step since approx

3 hours for a topN of 40000 pages and usually it takes

between 4 and 5 hours. The command top shows the following:

PID USER PR NI VIRT RES SHR S %CPU %MEM

TIME+ COMMAND16268 nutch 18 0 471m 159m 5808 S 100 31.1171:03.82 java

So it looks like the CPU is the limiting factor in our

case, when I interpret this correctly. This is quite strange
because a Pentium IV 2.8 GHz is not so old...

Regarding the swapping, I can't see any swapping

in the VPS as there is no swap really in OpenVZ, here is the
output of "free -m" during the generate step:

total used free shared

buffers     cached

Mem: 512 477 34 0

      0          0

-/+ buffers/cache:        477         34
Swap:            0          0          0

Nutch binaries is stored and running from a Linux/MD

software RAID 1 set with two Seagate ATA 7200 rpm harddisks.
The crawl directory and all its data is stored on a 8 TB
Linux RAID 6 NAS mounted via NFS with the following NFS
mount options:

rw,tcp,rsize=8192,wsize=8192

But if I understand correctly during the generate step

of Nutch there is only very low I/O activity because it is
only querying the crawldb, am I correct ?

About the URL filter we use pretty much the default

crawl-urlfilter.txt file except that we added maybe 2-3 more
extensions to skip and changed the "accept hosts"
to only index those ending in .be. You will find our
crawl-urlfilter.txt file at the end of this mail.

So I hope I have provided you enough information, if

you need more just let me know.

Thanks again and best regards


---------- crawl-urlfilter.txt -------

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$

# skip URLs containing certain characters as probable

queries, etc.

[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats

3+ times, to break loops

-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.).*\.be/

# skip everything else
-.

---------- crawl-urlfilter.txt -------




--- On Wed, 11/26/08, Dennis Kubes

<[EMAIL PROTECTED]> wrote:

From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very slow

after a few crawls (results)

To: [email protected]
Date: Wednesday, November 26, 2008, 4:46 AM
Well, generate will still have to go through all

of the

urls, although skipping after 25 per domain should

be really quick.  It

really depends on your hardware and any regexes

you may be running in

urlfilters for the generate step.  A 2.8Ghz Xeon

with 1G ram should be

pretty quick. 100,000 pages on a core2duo on my

laptop (4G Ram) takes

less than an hour if I remember correctly.  What

type of hard drive

(speed) are you using and are you swapping a lot

during generate?  It may

be the amount of RAM.

Dennis

ML mail wrote:

Dear Richard and others interested,

Just wanted to post the results of reducing

generate.max.per.host to 25 (instead of -1:

unlimited) as

recommended by Richard.
So actually to summarize, the fetch step has

been

greatly reduced to 1 hour instead of 6 hours (for

topN set

at 25000) but unfortunately the generate step is

still quite

slow and takes around 4 hours (for the same topN

amount).

Is this normal for the generate step to still

be so

slow ? The whole index is only around 170'000

pages big.

Is there maybe also an option in nutch-default.xml

config

file where one can optimize the generate process ?

Best regards




--- On Fri, 11/21/08, Richard Cyganiak

<[EMAIL PROTECTED]> wrote:

From: Richard Cyganiak

<[EMAIL PROTECTED]>

Subject: Re: Nutch generate and fetch very

slow

after a few crawls

To: [email protected]
Date: Friday, November 21, 2008, 2:42 AM
On 21 Nov 2008, at 09:47, ML mail wrote:

What would then be the solution to

this

problem ?

Shall I simply set generate.max.per.host

to

something like 5

? Or is there another way to make run

Nutch at a

good speed

again ?

I spent some time trying different values

for

generate.max.per.host and I found that

this is a

good rule

of thumb:

generate.max.per.host = topN /

numberOfNodes /

Where topN is the size of your segments,

numberOfNodes is

the number of machines in your cluser.

This keeps

the fetch

rate close to maximum.

Check the log of the fetch job -- if the

last few

pages

consist of request to just one or a few

hosts,

then your

value for generate.max.per.host is too

large. You

want to

fetch from many hosts in parallel

throughout the

entire

fetch job. On the other hand, if you set

it too

low, then

you will never make progress on these

large sites.

I fetched the same segment repeatedly to

find out

what

values work best.

Hope that helps,
Richard

Best regards


--- On Thu, 11/20/08, Dennis Kubes

<[EMAIL PROTECTED]> wrote:

From: Dennis Kubes

<[EMAIL PROTECTED]>

Subject: Re: Nutch generate and

fetch very

slow

after a few crawls

To: [email protected]
Date: Thursday, November 20, 2008,

2:40 PM

Off the top of my head I would

guess that

you hit

a patch of

urls all from the same domain and

that

would slow

down

fetching on a single host because

only one

thread

would be

active?  The generate.max.per.host

config

variable

can limit

that.

But that is just a guess.  What

job is it

slowing

down on?

Yes Nutch will take more time with

more

data, but

that is

too much of a difference.

Dennis

ML mail wrote:

Hello,

I am currently using the

recrawl

script from

the Nutch

Wiki for crawling all websites

from a

specific

small top

level domain and have configured

the

recrawl

script to run

with THREADS=50, DEPTH=4,

TOPN=25000.

Which means

that each

time I run the script 100'000

pages

will get

crawled.

The first time I ran the

script it

took 6

hours for

the whole process with mergesegs,

inverlinks,

index, merge

and so on. The second time it took

just 3

hours

more so 9

hours, then the 4th time 12 hours

but now

the

fourth time it

is actually still running after 22

hours

and

it's only

at the 64'000 page to be

crawled. It

looks

like that it

is especially the fetch step and

the index

step

which are

running much more slowly, the

other steps

look

normal.

So is this actually a normal

behavior

of Nutch

? I

would expect Nutch to be each time

a tiny

little

bit more

slower due to updating an always

growing

database/index/segment but never

so much

slower as

I am

currently experiencing. Especially

when

right now

there are

only 144'915 pages indexed and

the

whole crawl

directly

with everything is only around 2

GB big.

Nutch is running on a quite

good

Pentium 4

Xeon

computer 2.8 GHz with 1 GB RAM and

nothing

else

mutch

running on it, also I didn't

change

much in

the config

of Nutch itself so it's pretty

much

default.

Does anyone have an idea ? I

can

provide more

info if

you desire, just let me know what

you

need.

Many thanks in advance and

best

regards

Re: Nutch generate and fetch very slow after a few crawls (results)

Reply via email to