Generate times

Karol Rybak Mon, 26 Nov 2007 15:02:50 -0800

Hello i have a crawldb consisting of about 57 million urls, i'm generating
segments 1 million each.


Generate take 7 hrs 39 minutes to complete on my cluster.

I have 4 machines in my cluster each is Pentium 4 HT 3.0 Ghz with 1 GB ram
and 150GB IDE drives

Merging urls into crawldb took 4hrs, 34mins last time.

What i wanted to ask is if that times are normal for that kind of
configuration ?

Is generate phase so processor intensive ? When i check i have two threads
on each of nodes each taking up 100% time of ht pseudo-core.

Also i have a problem with partitioning 3 times out of 4 it fails because it
cannot open temporary files created by generate job.

I'm using trunk version of nutch with hadoop 0.15.

I need to find a way to speed up crawldb processing.

I want to create an updateable index of about 30 million pages which could
be updated every month.

I do not need the scoring-opic plugin, but i couldn't disable it. I do not
need it as i'm using index to search for plagiarism in our university
students papers.

I was thinking about moving whole crawldb into some database
(mysql/postgres) and generating urls to crawl from there, then importing
them to nutch using clean crawldb and text files.

Please let me know if you have any suggestions on how to speed up the
crawling process.

-- 
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology
and Management
+48(17)8661277

Generate times

Reply via email to