Re: Nutch generate and fetch very slow after a few crawls (results)

Dennis Kubes Thu, 27 Nov 2008 14:23:19 -0800

I think you have more than one problem going on here to create thesituation you are facing.

1) The biggest problem I see is the NFS mount. The generate step isstill going to write out intermediate output and is going to be doinglots of read input. With NFS it is having to do lots of networkroundtrips. Even when using DFS, data is downloaded to local disk firstbefore being processed in MapReduce.

2) 512M is low. I don't think that is the biggest problem but it coulddefinitely be causing some slowdown because of the amount of data thatcan be read and processed at one time.

3) CPU is at 100% and if it stays that way for a long time there may beother issues going on beside just processing data. These could befilter regex related or something else. I don't think this is the mainproblem though.


Dennis

ML mail wrote:

Dear Dennis

Here a few lines to explain better our current configuration: Nutch 0.9 is 
running in an OpenVZ Virtual Private Server with 512 MB RAM allocated to it and 
full CPU usage (a Pentium IV Xeon 2.8 GHz, the machine itself has 1 GB physical 
RAM). NUTCH_HEAPSIZE in the nutch shell script is set to 350 MB instead of the 
original 1000 MB.

Right now it is running the generate step since approx 3 hours for a topN of 40000 pages and usually it takes between 4 and 5 hours. The command top shows the following:PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND16268 nutch 18 0 471m 159m 5808 S 100 31.1 171:03.82 java

So it looks like the CPU is the limiting factor in our case, when I interpret 
this correctly. This is quite strange because a Pentium IV 2.8 GHz is not so 
old...

Regarding the swapping, I can't see any swapping in the VPS as there is no swap really in 
OpenVZ, here is the output of "free -m" during the generate step:

             total       used       free     shared    buffers     cached
Mem:           512        477         34          0          0          0
-/+ buffers/cache:        477         34
Swap:            0          0          0

Nutch binaries is stored and running from a Linux/MD software RAID 1 set with 
two Seagate ATA 7200 rpm harddisks. The crawl directory and all its data is 
stored on a 8 TB Linux RAID 6 NAS mounted via NFS with the following NFS mount 
options:

rw,tcp,rsize=8192,wsize=8192

But if I understand correctly during the generate step of Nutch there is only 
very low I/O activity because it is only querying the crawldb, am I correct ?

About the URL filter we use pretty much the default crawl-urlfilter.txt file except that 
we added maybe 2-3 more extensions to skip and changed the "accept hosts" to 
only index those ending in .be. You will find our crawl-urlfilter.txt file at the end of 
this mail.

So I hope I have provided you enough information, if you need more just let me 
know.

Thanks again and best regards


---------- crawl-urlfilter.txt -------

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.).*\.be/

# skip everything else
-.

---------- crawl-urlfilter.txt -------




--- On Wed, 11/26/08, Dennis Kubes <[EMAIL PROTECTED]> wrote:

From: Dennis Kubes <[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very slow after a few crawls (results)
To: [email protected]
Date: Wednesday, November 26, 2008, 4:46 AM
Well, generate will still have to go through all of the

urls, althoughskipping after 25 per domain should be really quick. Itreally dependson your hardware and any regexes you may be running inurlfilters forthe generate step. A 2.8Ghz Xeon with 1G ram should bepretty quick.100,000 pages on a core2duo on my laptop (4G Ram) takesless than anhour if I remember correctly. What type of hard drive(speed) are youusing and are you swapping a lot during generate? It maybe the amountof RAM.


Dennis

ML mail wrote:

Dear Richard and others interested,

Just wanted to post the results of reducing

generate.max.per.host to 25 (instead of -1: unlimited) as

recommended by Richard.

So actually to summarize, the fetch step has been

greatly reduced to 1 hour instead of 6 hours (for topN set
at 25000) but unfortunately the generate step is still quite

slow and takes around 4 hours (for the same topN amount).

Is this normal for the generate step to still be so

slow ? The whole index is only around 170'000 pages big.
Is there maybe also an option in nutch-default.xml config
file where one can optimize the generate process ?

Best regards




--- On Fri, 11/21/08, Richard Cyganiak

<[EMAIL PROTECTED]> wrote:

From: Richard Cyganiak <[EMAIL PROTECTED]>
Subject: Re: Nutch generate and fetch very slow

after a few crawls

To: [email protected]
Date: Friday, November 21, 2008, 2:42 AM
On 21 Nov 2008, at 09:47, ML mail wrote:

What would then be the solution to this

problem ?

Shall I simply set generate.max.per.host to

something like 5

? Or is there another way to make run Nutch at a

good speed

again ?

I spent some time trying different values for
generate.max.per.host and I found that this is a

good rule

of thumb:

generate.max.per.host = topN / numberOfNodes /

Where topN is the size of your segments,

numberOfNodes is

the number of machines in your cluser. This keeps

the fetch

rate close to maximum.

Check the log of the fetch job -- if the last few

pages

consist of request to just one or a few hosts,

then your

value for generate.max.per.host is too large. You

want to

fetch from many hosts in parallel throughout the

entire

fetch job. On the other hand, if you set it too

low, then

you will never make progress on these large sites.

I fetched the same segment repeatedly to find out

what

values work best.

Hope that helps,
Richard

Best regards


--- On Thu, 11/20/08, Dennis Kubes

<[EMAIL PROTECTED]> wrote:

From: Dennis Kubes

<[EMAIL PROTECTED]>

Subject: Re: Nutch generate and fetch very

slow

after a few crawls

To: [email protected]
Date: Thursday, November 20, 2008, 2:40 PM
Off the top of my head I would guess that

you hit

a patch of

urls all from the same domain and that

would slow

down

fetching on a single host because only one

thread

would be

active?  The generate.max.per.host config

variable

can limit

that.

But that is just a guess.  What job is it

slowing

down on?

Yes Nutch will take more time with more

data, but

that is

too much of a difference.

Dennis

ML mail wrote:

Hello,

I am currently using the recrawl

script from

the Nutch

Wiki for crawling all websites from a

specific

small top

level domain and have configured the

recrawl

script to run

with THREADS=50, DEPTH=4, TOPN=25000.

Which means

that each

time I run the script 100'000 pages

will get

crawled.

The first time I ran the script it

took 6

hours for

the whole process with mergesegs,

inverlinks,

index, merge

and so on. The second time it took just 3

hours

more so 9

hours, then the 4th time 12 hours but now

the

fourth time it

is actually still running after 22 hours

and

it's only

at the 64'000 page to be crawled. It

looks

like that it

is especially the fetch step and the index

step

which are

running much more slowly, the other steps

look

normal.

So is this actually a normal behavior

of Nutch

? I

would expect Nutch to be each time a tiny

little

bit more

slower due to updating an always growing
database/index/segment but never so much

slower as

I am

currently experiencing. Especially when

right now

there are

only 144'915 pages indexed and the

whole crawl

directly

with everything is only around 2 GB big.

Nutch is running on a quite good

Pentium 4

Xeon

computer 2.8 GHz with 1 GB RAM and nothing

else

mutch

running on it, also I didn't change

much in

the config

of Nutch itself so it's pretty much

default.

Does anyone have an idea ? I can

provide more

info if

you desire, just let me know what you

need.

Many thanks in advance and best

regards

Re: Nutch generate and fetch very slow after a few crawls (results)

Reply via email to