FYI : there is an implementation of such a modified Generator in
http://issues.apache.org/jira/browse/NUTCH-762

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/10/5 Andrzej Bialecki <a...@getopt.org>

> Eric wrote:
>
>> My plan is to crawl ~1.6M TLD's to a depth of 2. Is there a way I can
>> crawl it in increments of 100K? e.g. crawl 100K 16 times for the TLD's then
>> crawl the links generated from the TLD's in increments of 100K?
>>
>
> Yes. Make sure that you have the "generate.update.db" property set to true,
> and then generate 16 segments each having 100k urls. After you finish
> generating them, then you can start fetching.
>
> Similarly, you can do the same for the next level, only you will have to
> generate more segments.
>
> This could be done much simpler with a modified Generator that outputs
> multiple segments from one job, but it's not implemented yet.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Reply via email to