Re: Hadoop fetch jobs

Karol Rybak Thu, 18 Oct 2007 02:46:56 -0700

Well, that's not the case i have found out that those jobs have proper
number of pages , however they end prematurely as fetcher fails with out of
memory exception. Now i'm trying to fetch it without parsing, we'll see what
happens...


On 10/16/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:
>
> This is because some of the websites you are fetching have an unusually
> large number of pages.  Since Nutch partitions by hostname, all of these
> pages get assigned to a single fetcher.  The way to avoid this is to set
> a maximum number of pages per site through the generate.max.per.host
> configuration variable.  In production we have this set to 10.
>
> The downside of this is that some very large sites which you may want to
> fetch all of their content (i.e. wikipedia) still will only fetch the
> top 10 pages of that site per fetch cycle.
>
> Dennis
>
> Karol Rybak wrote:
> > Hello, i've succesfully set up cluster of 3 machines under hadoop.
> However i
> > have a problem. While fetching hadoop generates 6 jobs, however the
> number
> > of pages in each of those jobs is not spread equally i get 5 jobs with ~
> 3
> > 500 pages and one with ~ 50 000. That's not a good thing as 5 jobs
> finish
> > very quickly and afterwards only one machine is working while others are
> > waiting. Could this be a problem with my configuration, i've set number
> of
> > map jobs to 30, number of reduce jobs to 6 and fetcher threads to 150,
> > however during fetch i still get only 6 map jobs. Any help would be
> > appreciated, thanks.
> >
>



-- 
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology
and Management
+48(17)8661277

Re: Hadoop fetch jobs

Reply via email to