or it is stuck on a couple of hosts which time out? The logs should have a
trace with the number of active threads, which should give some indication
of what's happening.

Julien


2009/11/25 Dennis Kubes <ku...@apache.org>

> If it is waiting and the box is idle, my first though is not dns.  I just
> put that up as one of the things people will run into.  Most likely it is
> uneven distribution of urls or something like that.
>
> Dennis
>
>
> MilleBii wrote:
>
>> Get your point... Although I thought high number of threads would do
>> exactly the same. Maybe I miss something.
>>
>> During my fetcher runs used bandwidth gets low pretty quickly, disk
>> I/O is low, the CPU is low... So it must be waiting for something but
>> what ?
>>
>> Could be the DNS cache wich is full and any new request gets forwarded
>> to the master DNS of my ISP,
>> Any idea how to check that ? I'm not familiar with Bind myself... What
>> is the typical rate you can get how many dns request/s ?
>>
>>
>>
>> 2009/11/25, Dennis Kubes <ku...@apache.org>:
>>
>>> It is not about the local DNS caching as much as having local DNS
>>> servers.  Too many fetchers hitting a centralized DNS server can act as
>>> a DOS attack and slow down the entire fetching system.
>>>
>>> For example say I have a single centralized DNS server for my network.
>>> And say I have 2 map task per machine, 50 machines, 20 threads per task.
>>>  That would be 50 * 2 * 20 = 2000 fetchers.  Meaning a possibility of
>>> 2000  DNS requests / sec.  Most local DNS servers for smaller networks
>>> can't handle that.  If everything is hitting a centralized DNS and that
>>> DNS takes 1-3 sec per request because of too many requests.  The entire
>>> fetching system stalls.
>>>
>>> Hitting a secondary larger cache, such as OpenDNS, can have an effect
>>> because you are making one hop to get the name versus multiple hops to
>>> root servers then domain servers.
>>>
>>> Working off of a single server these issues don't show up as much
>>> because there aren't enough fetchers.
>>>
>>> Dennis Kubes
>>>
>>> MilleBii wrote:
>>>
>>>> Why would DNS local caching work... It only is working if you are
>>>> going to crawl often the same site ... In which case you are hit by
>>>> the politeness.
>>>>
>>>> if you have segments with only/mainly different sites it is not/really
>>>> going to help.
>>>>
>>>> So far I have not seen my quad core + 100mb/s + pseudo distributed
>>>> hadoop  going faster than 10 fetch / s... Let me check the DNS and I
>>>> will tell you.
>>>>
>>>> I vote for 100 Fetch/s not sure how to get it though
>>>>
>>>>
>>>>
>>>> 2009/11/24, Dennis Kubes <ku...@apache.org>:
>>>>
>>>>> Hi Mark,
>>>>>
>>>>> I just put this up on the wiki.  Hope it helps:
>>>>>
>>>>> http://wiki.apache.org/nutch/OptimizingCrawls
>>>>>
>>>>> Dennis
>>>>>
>>>>>
>>>>> Mark Kerzner wrote:
>>>>>
>>>>>> Hi, guys,
>>>>>>
>>>>>> my goal is to do by crawls at 100 fetches per second, observing, of
>>>>>> course,
>>>>>> polite crawling. But, when URLs are all different domains, what
>>>>>> theoretically would stop some software from downloading from 100
>>>>>> domains
>>>>>> at
>>>>>> once, achieving the desired speed?
>>>>>>
>>>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even if it
>>>>>> starts at a few dozen URLs/second, it slows down at the end (as
>>>>>> discussed
>>>>>> by
>>>>>> many and by Krugler).
>>>>>>
>>>>>> Should I write something of my own, or are their fast crawlers?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>
>>


-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Reply via email to