Well, that's not the case i have found out that those jobs have proper number of pages , however they end prematurely as fetcher fails with out of memory exception. Now i'm trying to fetch it without parsing, we'll see what happens...
On 10/16/07, Dennis Kubes <[EMAIL PROTECTED]> wrote: > > This is because some of the websites you are fetching have an unusually > large number of pages. Since Nutch partitions by hostname, all of these > pages get assigned to a single fetcher. The way to avoid this is to set > a maximum number of pages per site through the generate.max.per.host > configuration variable. In production we have this set to 10. > > The downside of this is that some very large sites which you may want to > fetch all of their content (i.e. wikipedia) still will only fetch the > top 10 pages of that site per fetch cycle. > > Dennis > > Karol Rybak wrote: > > Hello, i've succesfully set up cluster of 3 machines under hadoop. > However i > > have a problem. While fetching hadoop generates 6 jobs, however the > number > > of pages in each of those jobs is not spread equally i get 5 jobs with ~ > 3 > > 500 pages and one with ~ 50 000. That's not a good thing as 5 jobs > finish > > very quickly and afterwards only one machine is working while others are > > waiting. Could this be a problem with my configuration, i've set number > of > > map jobs to 30, number of reduce jobs to 6 and fetcher threads to 150, > > however during fetch i still get only 6 map jobs. Any help would be > > appreciated, thanks. > > > -- Karol Rybak Programista / Programmer Sekcja aplikacji / Applications section Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology and Management +48(17)8661277
