Julien, I tried to apply your patch because I was curious. $ patch < NUTCH-762-MultiGenerator.patch
but this seems to drop the two java files into the root directory instead of src/java/org/apache/nutch/crawl/URLPartitioner.java src/java/org/apache/nutch/crawl/MultiGenerator.java But if I copy the files to those locations, I get compile errors. I'm up to date on the svn trunk. Did I miss a step? Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Tue, Nov 3, 2009 at 7:09 AM, Julien Nioche <lists.digitalpeb...@gmail.com > wrote: > FYI : there is an implementation of such a modified Generator in > http://issues.apache.org/jira/browse/NUTCH-762 > > Julien > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > 2009/10/5 Andrzej Bialecki <a...@getopt.org> > > > Eric wrote: > > > >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can > >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's > then > >> crawl the links generated from the TLD's in increments of 100K? > >> > > > > Yes. Make sure that you have the "generate.update.db" property set to > true, > > and then generate 16 segments each having 100k urls. After you finish > > generating them, then you can start fetching. > > > > Similarly, you can do the same for the next level, only you will have to > > generate more segments. > > > > This could be done much simpler with a modified Generator that outputs > > multiple segments from one job, but it's not implemented yet. > > > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > >