Hi Jesse, no problem. Feel free to post your comments / bug fixes / suggestions on the JIRA NUTCH-762
Thanks -- DigitalPebble Ltd http://www.digitalpebble.com 2009/11/4 Jesse Hires <jhi...@gmail.com> > My apologies. missed a patch option :-P > Must need more coffee. > Jesse > > int GetRandomNumber() > { > return 4; // Chosen by fair roll of dice > // Guaranteed to be random > } // xkcd.com > > > > On Tue, Nov 3, 2009 at 8:08 PM, Jesse Hires <jhi...@gmail.com> wrote: > > > Julien, > > I tried to apply your patch because I was curious. > > $ patch < NUTCH-762-MultiGenerator.patch > > > > but this seems to drop the two java files into the root directory instead > > of > > src/java/org/apache/nutch/crawl/URLPartitioner.java > > src/java/org/apache/nutch/crawl/MultiGenerator.java > > > > But if I copy the files to those locations, I get compile errors. > > I'm up to date on the svn trunk. > > Did I miss a step? > > > > > > Jesse > > > > int GetRandomNumber() > > { > > return 4; // Chosen by fair roll of dice > > // Guaranteed to be random > > } // xkcd.com > > > > > > > > > > On Tue, Nov 3, 2009 at 7:09 AM, Julien Nioche < > > lists.digitalpeb...@gmail.com> wrote: > > > >> FYI : there is an implementation of such a modified Generator in > >> http://issues.apache.org/jira/browse/NUTCH-762 > >> > >> Julien > >> -- > >> DigitalPebble Ltd > >> http://www.digitalpebble.com > >> > >> 2009/10/5 Andrzej Bialecki <a...@getopt.org> > >> > >> > Eric wrote: > >> > > >> >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can > >> >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the > TLD's > >> then > >> >> crawl the links generated from the TLD's in increments of 100K? > >> >> > >> > > >> > Yes. Make sure that you have the "generate.update.db" property set to > >> true, > >> > and then generate 16 segments each having 100k urls. After you finish > >> > generating them, then you can start fetching. > >> > > >> > Similarly, you can do the same for the next level, only you will have > to > >> > generate more segments. > >> > > >> > This could be done much simpler with a modified Generator that outputs > >> > multiple segments from one job, but it's not implemented yet. > >> > > >> > > >> > -- > >> > Best regards, > >> > Andrzej Bialecki <>< > >> > ___. ___ ___ ___ _ _ __________________________________ > >> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > >> > ___|||__|| \| || | Embedded Unix, System Integration > >> > http://www.sigram.com Contact: info at sigram dot com > >> > > >> > > >> > > > > >