Hi-
I don't have as large of a list of urls as you, but it is in the
millions. I also see really long times for generate, about 3 hours. This is
definitely the largest part of my wait.
I have posted here before trying to figure this out. The thing is, I
can do a unix sort on a comperable list much more quickly, so I suspect
something is being done inefficiently. I don't know fully what completely
is happening inside nutch, so I am not sure.
I suspect I could cut the time waiting for generate down by generating
multiple segments at once, but I haven't spent much time to get this
working.
see you
-Jim
----- Original Message -----
From: "Karol Rybak" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Monday, November 26, 2007 3:02 PM
Subject: Generate times
Hello i have a crawldb consisting of about 57 million urls, i'm generating
segments 1 million each.
Generate take 7 hrs 39 minutes to complete on my cluster.
I have 4 machines in my cluster each is Pentium 4 HT 3.0 Ghz with 1 GB ram
and 150GB IDE drives
Merging urls into crawldb took 4hrs, 34mins last time.
What i wanted to ask is if that times are normal for that kind of
configuration ?
Is generate phase so processor intensive ? When i check i have two threads
on each of nodes each taking up 100% time of ht pseudo-core.
Also i have a problem with partitioning 3 times out of 4 it fails because
it
cannot open temporary files created by generate job.
I'm using trunk version of nutch with hadoop 0.15.
I need to find a way to speed up crawldb processing.
I want to create an updateable index of about 30 million pages which could
be updated every month.
I do not need the scoring-opic plugin, but i couldn't disable it. I do not
need it as i'm using index to search for plagiarism in our university
students papers.
I was thinking about moving whole crawldb into some database
(mysql/postgres) and generating urls to crawl from there, then importing
them to nutch using clean crawldb and text files.
Please let me know if you have any suggestions on how to speed up the
crawling process.
--
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of Internet
Technology
and Management
+48(17)8661277