FYI : there is an implementation of such a modified Generator in http://issues.apache.org/jira/browse/NUTCH-762
Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/10/5 Andrzej Bialecki <a...@getopt.org> > Eric wrote: > >> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can >> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's then >> crawl the links generated from the TLD's in increments of 100K? >> > > Yes. Make sure that you have the "generate.update.db" property set to true, > and then generate 16 segments each having 100k urls. After you finish > generating them, then you can start fetching. > > Similarly, you can do the same for the next level, only you will have to > generate more segments. > > This could be done much simpler with a modified Generator that outputs > multiple segments from one job, but it's not implemented yet. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >