Actually setting -noParsing helped but only a bit i got about 6000 pages fetched per job (1000 earlier). I'll try using fetch instead of fetch2, hope that this will help. Another question is how do i control the number of fetch jobs, cause they do not behave as typical map jobs ?
On 10/18/07, Karol Rybak <[EMAIL PROTECTED]> wrote: > > Well, that's not the case i have found out that those jobs have proper > number of pages , however they end prematurely as fetcher fails with out of > memory exception. Now i'm trying to fetch it without parsing, we'll see what > happens... > > On 10/16/07, Dennis Kubes <[EMAIL PROTECTED] > wrote: > > > > This is because some of the websites you are fetching have an unusually > > large number of pages. Since Nutch partitions by hostname, all of these > > pages get assigned to a single fetcher. The way to avoid this is to set > > > > a maximum number of pages per site through the generate.max.per.host > > configuration variable. In production we have this set to 10. > > > > The downside of this is that some very large sites which you may want to > > fetch all of their content (i.e. wikipedia) still will only fetch the > > top 10 pages of that site per fetch cycle. > > > > Dennis > > > > Karol Rybak wrote: > > > Hello, i've succesfully set up cluster of 3 machines under hadoop. > > However i > > > have a problem. While fetching hadoop generates 6 jobs, however the > > number > > > of pages in each of those jobs is not spread equally i get 5 jobs with > > ~ 3 > > > 500 pages and one with ~ 50 000. That's not a good thing as 5 jobs > > finish > > > very quickly and afterwards only one machine is working while others > > are > > > waiting. Could this be a problem with my configuration, i've set > > number of > > > map jobs to 30, number of reduce jobs to 6 and fetcher threads to 150, > > > > > however during fetch i still get only 6 map jobs. Any help would be > > > appreciated, thanks. > > > > > > > > > -- > Karol Rybak > Programista / Programmer > Sekcja aplikacji / Applications section > Wyższa Szkoła Informatyki i Zarządzania / University of Internet > Technology and Management > +48(17)8661277 > -- Karol Rybak Programista / Programmer Sekcja aplikacji / Applications section Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology and Management +48(17)8661277
