Re: Incremental Whole Web Crawling

Jesse Hires Tue, 03 Nov 2009 20:09:07 -0800

Julien,
I tried to apply your patch because I was curious.
$ patch < NUTCH-762-MultiGenerator.patch


but this seems to drop the two java files into the root directory instead of
src/java/org/apache/nutch/crawl/URLPartitioner.java
src/java/org/apache/nutch/crawl/MultiGenerator.java

But if I copy the files to those locations, I get compile errors.
I'm up to date on the svn trunk.
Did I miss a step?


Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Tue, Nov 3, 2009 at 7:09 AM, Julien Nioche <lists.digitalpeb...@gmail.com
> wrote:

> FYI : there is an implementation of such a modified Generator in
> http://issues.apache.org/jira/browse/NUTCH-762
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2009/10/5 Andrzej Bialecki <a...@getopt.org>
>
> > Eric wrote:
> >
> >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
> >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's
> then
> >> crawl the links generated from the TLD's in increments of 100K?
> >>
> >
> > Yes. Make sure that you have the "generate.update.db" property set to
> true,
> > and then generate 16 segments each having 100k urls. After you finish
> > generating them, then you can start fetching.
> >
> > Similarly, you can do the same for the next level, only you will have to
> > generate more segments.
> >
> > This could be done much simpler with a modified Generator that outputs
> > multiple segments from one job, but it's not implemented yet.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
>

Re: Incremental Whole Web Crawling

Reply via email to