Gang,          I can see lots of discussion about fetching large site like 
wikipedia but none of then gives a concrete picture on how to fetch without any 
problem. I have six fetcher job running and all URLs are assigned to a single 
job as I think Nutch partitions by hostname and all of these pages get assigned 
to a single fetcher. If I use generate.max.per.host parameter to restrict 
number of urls per job, will it able to distribute urls uniformly across all 
jobs?
As this is going to be a major issue, I am thinking to twik the nutch so that 
URLs would be assigned based on number than host i.e. if a job reaches some 
number then it will assign to other jobs for fetching.
Not sure, which one should be following to have a successful fetching of large 
site.
- RB


      

Reply via email to