This is because some of the websites you are fetching have an unusually large number of pages. Since Nutch partitions by hostname, all of these pages get assigned to a single fetcher. The way to avoid this is to set a maximum number of pages per site through the generate.max.per.host configuration variable. In production we have this set to 10.

The downside of this is that some very large sites which you may want to fetch all of their content (i.e. wikipedia) still will only fetch the top 10 pages of that site per fetch cycle.

Dennis

Karol Rybak wrote:
Hello, i've succesfully set up cluster of 3 machines under hadoop. However i
have a problem. While fetching hadoop generates 6 jobs, however the number
of pages in each of those jobs is not spread equally i get 5 jobs with ~ 3
500 pages and one with ~ 50 000. That's not a good thing as 5 jobs finish
very quickly and afterwards only one machine is working while others are
waiting. Could this be a problem with my configuration, i've set number of
map jobs to 30, number of reduce jobs to 6 and fetcher threads to 150,
however during fetch i still get only 6 map jobs. Any help would be
appreciated, thanks.

Reply via email to