Actually setting -noParsing helped but only a bit i got about 6000 pages
fetched per job (1000 earlier). I'll try using fetch instead of fetch2, hope
that this will help. Another question is how do i control the number of
fetch jobs, cause they do not behave as typical map jobs ?

On 10/18/07, Karol Rybak <[EMAIL PROTECTED]> wrote:
>
> Well, that's not the case i have found out that those jobs have proper
> number of pages , however they end prematurely as fetcher fails with out of
> memory exception. Now i'm trying to fetch it without parsing, we'll see what
> happens...
>
> On 10/16/07, Dennis Kubes <[EMAIL PROTECTED] > wrote:
> >
> > This is because some of the websites you are fetching have an unusually
> > large number of pages.  Since Nutch partitions by hostname, all of these
> > pages get assigned to a single fetcher.  The way to avoid this is to set
> >
> > a maximum number of pages per site through the generate.max.per.host
> > configuration variable.  In production we have this set to 10.
> >
> > The downside of this is that some very large sites which you may want to
> > fetch all of their content (i.e. wikipedia) still will only fetch the
> > top 10 pages of that site per fetch cycle.
> >
> > Dennis
> >
> > Karol Rybak wrote:
> > > Hello, i've succesfully set up cluster of 3 machines under hadoop.
> > However i
> > > have a problem. While fetching hadoop generates 6 jobs, however the
> > number
> > > of pages in each of those jobs is not spread equally i get 5 jobs with
> > ~ 3
> > > 500 pages and one with ~ 50 000. That's not a good thing as 5 jobs
> > finish
> > > very quickly and afterwards only one machine is working while others
> > are
> > > waiting. Could this be a problem with my configuration, i've set
> > number of
> > > map jobs to 30, number of reduce jobs to 6 and fetcher threads to 150,
> >
> > > however during fetch i still get only 6 map jobs. Any help would be
> > > appreciated, thanks.
> > >
> >
>
>
>
> --
> Karol Rybak
> Programista / Programmer
> Sekcja aplikacji / Applications section
> Wyższa Szkoła Informatyki i Zarządzania / University of Internet
> Technology and Management
> +48(17)8661277
>



-- 
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology
and Management
+48(17)8661277

Reply via email to