I have a strange requirement which I don't know Nutch takes care of or not. Any guidance would help a lot. I have to crawl many intranet sites which include, site1, site2, site3, .... , site9, site 10. Our requirement is that for "depth 8" all possible pages of site9 and site10 should be indexed. For site1, site2, ... site8 the we'll be using "topN 1000". But we are unable to achieve this because the "topN 1000" applies to the whole crawl and thus lots of pages of site9 and site10 are sacrificed during the topN 1000 selection.
Currently we have thought of a solution like this ... 1) Do first crawl with these settings - put site9 and site10 in seed urls and do a crawl with "-depth 8" and no topN parameters. So there is no topN selection and all possible pages are indexed. 2) Do another crawl with these settings - site1, site2, ... site8 in seed urls and do a crawl with "-depth 8" and "-topN 1000" parameters. 3) Merge both the crawldb, index, etc. of both the crawls and create one crawl folder. Is there a better way? Can the first step and second step be ran simultaneously on one machine? Can the first step and second step be made to write the segments in the same crawl folder? Would Nutch clustering be of some help here? Any guidance whatsoever would be really helpful for us.
