Hi,
Do you run generate with filter? Depending on your filter-settings, this
will really make generate alot slower.
If you do not need normalize (e.g the URL's are already normalized) then it
will really help to add this to nutch-site.xml:
<property>
<name>urlnormalizer.scope.partition</name>
<value>org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer</value>
</property>
<property>
<name>urlnormalizer.scope.generate_host_count</name>
<value>org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer</value>
</property>
I've just run the first part of generate on a 43 million crawldb, in 13,5
min on my 3 node cluster (3 tasks per node). node hw: 4GB RAM ~700GB disk
2.8 GHz 4 core.
Hope this helps!
Espen
On 11/27/07, misc <[EMAIL PROTECTED]> wrote:
>
>
> Hi-
>
> I don't have as large of a list of urls as you, but it is in the
> millions. I also see really long times for generate, about 3 hours. This
> is
> definitely the largest part of my wait.
>
> I have posted here before trying to figure this out. The thing is, I
> can do a unix sort on a comperable list much more quickly, so I suspect
> something is being done inefficiently. I don't know fully what completely
> is happening inside nutch, so I am not sure.
>
> I suspect I could cut the time waiting for generate down by generating
> multiple segments at once, but I haven't spent much time to get this
> working.
>
> see you
> -Jim
>
>
> ----- Original Message -----
> From: "Karol Rybak" <[EMAIL PROTECTED]>
> To: <[email protected]>
> Sent: Monday, November 26, 2007 3:02 PM
> Subject: Generate times
>
>
> > Hello i have a crawldb consisting of about 57 million urls, i'm
> generating
> > segments 1 million each.
> >
> > Generate take 7 hrs 39 minutes to complete on my cluster.
> >
> > I have 4 machines in my cluster each is Pentium 4 HT 3.0 Ghz with 1 GB
> ram
> > and 150GB IDE drives
> >
> > Merging urls into crawldb took 4hrs, 34mins last time.
> >
> > What i wanted to ask is if that times are normal for that kind of
> > configuration ?
> >
> > Is generate phase so processor intensive ? When i check i have two
> threads
> > on each of nodes each taking up 100% time of ht pseudo-core.
> >
> > Also i have a problem with partitioning 3 times out of 4 it fails
> because
> > it
> > cannot open temporary files created by generate job.
> >
> > I'm using trunk version of nutch with hadoop 0.15.
> >
> > I need to find a way to speed up crawldb processing.
> >
> > I want to create an updateable index of about 30 million pages which
> could
> > be updated every month.
> >
> > I do not need the scoring-opic plugin, but i couldn't disable it. I do
> not
> > need it as i'm using index to search for plagiarism in our university
> > students papers.
> >
> > I was thinking about moving whole crawldb into some database
> > (mysql/postgres) and generating urls to crawl from there, then importing
> > them to nutch using clean crawldb and text files.
> >
> > Please let me know if you have any suggestions on how to speed up the
> > crawling process.
> >
> > --
> > Karol Rybak
> > Programista / Programmer
> > Sekcja aplikacji / Applications section
> > Wyższa Szkoła Informatyki i Zarządzania / University of Internet
> > Technology
> > and Management
> > +48(17)8661277
> >
>
>