Re: Fetching inefficiency

ogjunk-nutch Tue, 22 Apr 2008 20:59:56 -0700

Siddhartha,

I think decreasing generate.max.per.host will limit the 'wait time' for each 
fetch run, but I have a feeling that the overall time will be roughly the same. 
 As a matter of fact, it may be even higher, because you'll have to run 
generate more times, and if your fetch jobs are too short, you will be spending 
more time waiting on MapReduce jobs (JVM instantiation, job initialization....)



Have you tried NUTCH-570?  I know it doesn't break anything, but I have not 
been able to see its positive effects - likely because my fetch cycles are 
dominated by those slow servers with lots of pages and not by wait time between 
subsequent requests to the same server.  But I'd love to hear if others found 
NUTCH-570 helpful!

Otis 
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Siddhartha Reddy <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Monday, April 21, 2008 4:59:03 PM
> Subject: Re: Fetching inefficiency
> 
> I do face a similar problem. I occasionally have some fetch jobs that are
> fetching from less than 100 hosts, the effect is magnified in this case.
> 
> I have found one workaround for this but I am not sure if this is the best
> possible solution: I set the value of generate.max.per.host to a pretty
> small value (like 1000) and this reduces the maximum amount of time any task
> is going to be held up due to a particular host. This does increase the
> number of cycles that are needed to finish a crawl but does solve the
> mentioned problem. It might even make sense to have an even lower value -- I
> am still experimenting to find a good value myself.
> 
> In addition, I think NUTCH-629 and NUTCH-570 could help reduce the effects
> of the problem caused by slow servers.
> 
> Best,
> Siddhartha Reddy
> 
> On Tue, Apr 22, 2008 at 1:46 AM, wrote:
> 
> > Hello,
> >
> > I am wondering how others deal with the following, which I see as fetching
> > inefficiency:
> >
> >
> > When fetching, the fetchlist is broken up into multiple parts and fetchers
> > on cluster nodes start fetching.  Some fetchers end up fetching from fast
> > servers, and some from very very slow servers.  Those fetching from slow
> > servers take a long time to complete and prolong the whole fetching process.
> >  For instance, I've seen tasks from the same fetch job finish in only 1-2
> > hours, and others in 10 hours.  Those taking 10 hours were stuck fetching
> > pages from a single or handful of slow sites.  If you have two nodes doing
> > the fetching and one is stuck with a slow server, the other one is idling
> > and wasting time.  The node stuck with the slow server is also
> > underutilized, as it's slowly fetching from only 1 server instead of many.
> >
> > I imagine anyone using Nutch is seeing the same.  If not, what's the
> > trick?
> >
> > I have not tried overlapping fetching jobs yet, but I have a feeling that
> > won't help a ton, plus it could lead to two fetchers fetching from the same
> > server and being impolite - am I wrong?
> >
> > Thanks,
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> 
> 
> -- 
> http://sids.in
> "If you are not having fun, you are not doing it right."

Re: Fetching inefficiency

Reply via email to