Hi, Mubey, If you so like, I have a crawler that is site based, but I guess I will charge a fee for licensing. Are you interested?
Jian www.jiansnet.com Quality search with affordable price On 11/30/07, Mubey N. <[EMAIL PROTECTED]> wrote: > > I have a strange requirement which I don't know Nutch takes care of or > not. Any guidance would help a lot. I have to crawl many intranet > sites which include, site1, site2, site3, .... , site9, site 10. Our > requirement is that for "depth 8" all possible pages of site9 and > site10 should be indexed. For site1, site2, ... site8 the we'll be > using "topN 1000". But we are unable to achieve this because the > "topN 1000" applies to the whole crawl and thus lots of pages of site9 > and site10 are sacrificed during the topN 1000 selection. > > Currently we have thought of a solution like this ... > > 1) Do first crawl with these settings - put site9 and site10 in seed > urls and do a crawl with "-depth 8" and no topN parameters. So there > is no topN selection and all possible pages are indexed. > > 2) Do another crawl with these settings - site1, site2, ... site8 in > seed urls and do a crawl with "-depth 8" and "-topN 1000" parameters. > > 3) Merge both the crawldb, index, etc. of both the crawls and create > one crawl folder. > > Is there a better way? Can the first step and second step be ran > simultaneously on one machine? Can the first step and second step be > made to write the segments in the same crawl folder? Would Nutch > clustering be of some help here? Any guidance whatsoever would be > really helpful for us. >
