Re: Hadoop fetch jobs

Dennis Kubes Tue, 16 Oct 2007 06:42:00 -0700

This is because some of the websites you are fetching have an unusuallylarge number of pages. Since Nutch partitions by hostname, all of thesepages get assigned to a single fetcher. The way to avoid this is to seta maximum number of pages per site through the generate.max.per.hostconfiguration variable. In production we have this set to 10.

The downside of this is that some very large sites which you may want tofetch all of their content (i.e. wikipedia) still will only fetch thetop 10 pages of that site per fetch cycle.


Dennis

Karol Rybak wrote:

Hello, i've succesfully set up cluster of 3 machines under hadoop. However i
have a problem. While fetching hadoop generates 6 jobs, however the number
of pages in each of those jobs is not spread equally i get 5 jobs with ~ 3
500 pages and one with ~ 50 000. That's not a good thing as 5 jobs finish
very quickly and afterwards only one machine is working while others are
waiting. Could this be a problem with my configuration, i've set number of
map jobs to 30, number of reduce jobs to 6 and fetcher threads to 150,
however during fetch i still get only 6 map jobs. Any help would be
appreciated, thanks.

Re: Hadoop fetch jobs

Reply via email to