Re: Generate times

misc Tue, 27 Nov 2007 14:16:41 -0800

Hi-

I don't have as large of a list of urls as you, but it is in themillions. I also see really long times for generate, about 3 hours. This isdefinitely the largest part of my wait.

I have posted here before trying to figure this out. The thing is, Ican do a unix sort on a comperable list much more quickly, so I suspectsomething is being done inefficiently. I don't know fully what completelyis happening inside nutch, so I am not sure.

I suspect I could cut the time waiting for generate down by generatingmultiple segments at once, but I haven't spent much time to get thisworking.


                       see you
                           -Jim

----- Original Message -----From: "Karol Rybak" <[EMAIL PROTECTED]>

To: <[email protected]>
Sent: Monday, November 26, 2007 3:02 PM
Subject: Generate times

Hello i have a crawldb consisting of about 57 million urls, i'm generating
segments 1 million each.

Generate take 7 hrs 39 minutes to complete on my cluster.

I have 4 machines in my cluster each is Pentium 4 HT 3.0 Ghz with 1 GB ram
and 150GB IDE drives

Merging urls into crawldb took 4hrs, 34mins last time.

What i wanted to ask is if that times are normal for that kind of
configuration ?

Is generate phase so processor intensive ? When i check i have two threads
on each of nodes each taking up 100% time of ht pseudo-core.

Also i have a problem with partitioning 3 times out of 4 it fails becauseit

cannot open temporary files created by generate job.

I'm using trunk version of nutch with hadoop 0.15.

I need to find a way to speed up crawldb processing.

I want to create an updateable index of about 30 million pages which could
be updated every month.

I do not need the scoring-opic plugin, but i couldn't disable it. I do not
need it as i'm using index to search for plagiarism in our university
students papers.

I was thinking about moving whole crawldb into some database
(mysql/postgres) and generating urls to crawl from there, then importing
them to nutch using clean crawldb and text files.

Please let me know if you have any suggestions on how to speed up the
crawling process.

--
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section

Wyższa Szkoła Informatyki i Zarządzania / University of InternetTechnology

and Management
+48(17)8661277

Re: Generate times

Reply via email to