Hi Jian, Thank you for your reply. To be frank, we are not interested in any other crawler (for no cost or low cost) right now. I am sure you have a good crawler but we will be sticking to Nutch for at least sometime. We'll move to any other crawler only, if after realising the full power of Nutch we find that it is not suitable for our job.
I think what we want to do is simple because we can already achieve it with multiple crawls as described in my previous mail (quoted below) and merging the crawls. Just wondering if there is a better way to do it. On Nov 30, 2007 1:59 PM, jian chen <[EMAIL PROTECTED]> wrote: > If you so like, I have a crawler that is site based, but I guess I will > charge a fee for licensing. Are you interested? > > Jian > > On 11/30/07, Mubey N. <[EMAIL PROTECTED]> wrote: > > > > I have a strange requirement which I don't know Nutch takes care of or > > not. Any guidance would help a lot. I have to crawl many intranet > > sites which include, site1, site2, site3, .... , site9, site 10. Our > > requirement is that for "depth 8" all possible pages of site9 and > > site10 should be indexed. For site1, site2, ... site8 the we'll be > > using "topN 1000". But we are unable to achieve this because the > > "topN 1000" applies to the whole crawl and thus lots of pages of site9 > > and site10 are sacrificed during the topN 1000 selection. > > > > Currently we have thought of a solution like this ... > > > > 1) Do first crawl with these settings - put site9 and site10 in seed > > urls and do a crawl with "-depth 8" and no topN parameters. So there > > is no topN selection and all possible pages are indexed. > > > > 2) Do another crawl with these settings - site1, site2, ... site8 in > > seed urls and do a crawl with "-depth 8" and "-topN 1000" parameters. > > > > 3) Merge both the crawldb, index, etc. of both the crawls and create > > one crawl folder. > > > > Is there a better way? Can the first step and second step be ran > > simultaneously on one machine? Can the first step and second step be > > made to write the segments in the same crawl folder? Would Nutch > > clustering be of some help here? Any guidance whatsoever would be > > really helpful for us. > > >
