Hi Mubey,
Did you try a writing a custom Generator.
Take a closer look at Generator class code.
The Custom Generator's map will have a key "SiteN SortValue" combo and the reduce will limit count
per site



Mubey N. wrote:
I have a strange requirement which I don't know Nutch takes care of or
not. Any guidance would help a lot. I have to crawl many intranet
sites which include, site1, site2, site3, .... , site9, site 10. Our
requirement is that for "depth 8" all possible pages of site9 and
site10 should be indexed. For site1, site2, ... site8 the we'll be
using "topN 1000". But we are  unable to achieve this because the
"topN 1000" applies to the whole crawl and thus lots of pages of site9
and site10 are sacrificed during the topN 1000 selection.

Currently we have thought of a solution like this ...

1) Do first crawl with these settings - put site9 and site10 in seed
urls and do a crawl with "-depth 8" and no topN parameters. So there
is no topN selection and all possible pages are indexed.

2) Do another crawl with these settings - site1, site2, ... site8 in
seed urls and do a crawl with "-depth 8" and "-topN 1000" parameters.

3) Merge both the crawldb, index, etc. of both the crawls and create
one crawl folder.

Is there a better way? Can the first step and second step be ran
simultaneously on one machine? Can the first step and second step be
made to write the segments in the same crawl folder? Would Nutch
clustering be of some help here? Any guidance whatsoever would be
really helpful for us.



--
This message has been scanned for viruses and
dangerous content and is believed to be clean.

Reply via email to