Hi, Mubey,

If you so like, I have a crawler that is site based, but I guess I will
charge a fee for licensing. Are you interested?

Jian
www.jiansnet.com
Quality search with affordable price

On 11/30/07, Mubey N. <[EMAIL PROTECTED]> wrote:
>
> I have a strange requirement which I don't know Nutch takes care of or
> not. Any guidance would help a lot. I have to crawl many intranet
> sites which include, site1, site2, site3, .... , site9, site 10. Our
> requirement is that for "depth 8" all possible pages of site9 and
> site10 should be indexed. For site1, site2, ... site8 the we'll be
> using "topN 1000". But we are  unable to achieve this because the
> "topN 1000" applies to the whole crawl and thus lots of pages of site9
> and site10 are sacrificed during the topN 1000 selection.
>
> Currently we have thought of a solution like this ...
>
> 1) Do first crawl with these settings - put site9 and site10 in seed
> urls and do a crawl with "-depth 8" and no topN parameters. So there
> is no topN selection and all possible pages are indexed.
>
> 2) Do another crawl with these settings - site1, site2, ... site8 in
> seed urls and do a crawl with "-depth 8" and "-topN 1000" parameters.
>
> 3) Merge both the crawldb, index, etc. of both the crawls and create
> one crawl folder.
>
> Is there a better way? Can the first step and second step be ran
> simultaneously on one machine? Can the first step and second step be
> made to write the segments in the same crawl folder? Would Nutch
> clustering be of some help here? Any guidance whatsoever would be
> really helpful for us.
>

Reply via email to