I do face a similar problem. I occasionally have some fetch jobs that are
fetching from less than 100 hosts, the effect is magnified in this case.

I have found one workaround for this but I am not sure if this is the best
possible solution: I set the value of generate.max.per.host to a pretty
small value (like 1000) and this reduces the maximum amount of time any task
is going to be held up due to a particular host. This does increase the
number of cycles that are needed to finish a crawl but does solve the
mentioned problem. It might even make sense to have an even lower value -- I
am still experimenting to find a good value myself.

In addition, I think NUTCH-629 and NUTCH-570 could help reduce the effects
of the problem caused by slow servers.

Best,
Siddhartha Reddy

On Tue, Apr 22, 2008 at 1:46 AM, <[EMAIL PROTECTED]> wrote:

> Hello,
>
> I am wondering how others deal with the following, which I see as fetching
> inefficiency:
>
>
> When fetching, the fetchlist is broken up into multiple parts and fetchers
> on cluster nodes start fetching.  Some fetchers end up fetching from fast
> servers, and some from very very slow servers.  Those fetching from slow
> servers take a long time to complete and prolong the whole fetching process.
>  For instance, I've seen tasks from the same fetch job finish in only 1-2
> hours, and others in 10 hours.  Those taking 10 hours were stuck fetching
> pages from a single or handful of slow sites.  If you have two nodes doing
> the fetching and one is stuck with a slow server, the other one is idling
> and wasting time.  The node stuck with the slow server is also
> underutilized, as it's slowly fetching from only 1 server instead of many.
>
> I imagine anyone using Nutch is seeing the same.  If not, what's the
> trick?
>
> I have not tried overlapping fetching jobs yet, but I have a feeling that
> won't help a ton, plus it could lead to two fetchers fetching from the same
> server and being impolite - am I wrong?
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>


-- 
http://sids.in
"If you are not having fun, you are not doing it right."

Reply via email to