or it is stuck on a couple of hosts which time out? The logs should have a trace with the number of active threads, which should give some indication of what's happening.
Julien 2009/11/25 Dennis Kubes <ku...@apache.org> > If it is waiting and the box is idle, my first though is not dns. I just > put that up as one of the things people will run into. Most likely it is > uneven distribution of urls or something like that. > > Dennis > > > MilleBii wrote: > >> Get your point... Although I thought high number of threads would do >> exactly the same. Maybe I miss something. >> >> During my fetcher runs used bandwidth gets low pretty quickly, disk >> I/O is low, the CPU is low... So it must be waiting for something but >> what ? >> >> Could be the DNS cache wich is full and any new request gets forwarded >> to the master DNS of my ISP, >> Any idea how to check that ? I'm not familiar with Bind myself... What >> is the typical rate you can get how many dns request/s ? >> >> >> >> 2009/11/25, Dennis Kubes <ku...@apache.org>: >> >>> It is not about the local DNS caching as much as having local DNS >>> servers. Too many fetchers hitting a centralized DNS server can act as >>> a DOS attack and slow down the entire fetching system. >>> >>> For example say I have a single centralized DNS server for my network. >>> And say I have 2 map task per machine, 50 machines, 20 threads per task. >>> That would be 50 * 2 * 20 = 2000 fetchers. Meaning a possibility of >>> 2000 DNS requests / sec. Most local DNS servers for smaller networks >>> can't handle that. If everything is hitting a centralized DNS and that >>> DNS takes 1-3 sec per request because of too many requests. The entire >>> fetching system stalls. >>> >>> Hitting a secondary larger cache, such as OpenDNS, can have an effect >>> because you are making one hop to get the name versus multiple hops to >>> root servers then domain servers. >>> >>> Working off of a single server these issues don't show up as much >>> because there aren't enough fetchers. >>> >>> Dennis Kubes >>> >>> MilleBii wrote: >>> >>>> Why would DNS local caching work... It only is working if you are >>>> going to crawl often the same site ... In which case you are hit by >>>> the politeness. >>>> >>>> if you have segments with only/mainly different sites it is not/really >>>> going to help. >>>> >>>> So far I have not seen my quad core + 100mb/s + pseudo distributed >>>> hadoop going faster than 10 fetch / s... Let me check the DNS and I >>>> will tell you. >>>> >>>> I vote for 100 Fetch/s not sure how to get it though >>>> >>>> >>>> >>>> 2009/11/24, Dennis Kubes <ku...@apache.org>: >>>> >>>>> Hi Mark, >>>>> >>>>> I just put this up on the wiki. Hope it helps: >>>>> >>>>> http://wiki.apache.org/nutch/OptimizingCrawls >>>>> >>>>> Dennis >>>>> >>>>> >>>>> Mark Kerzner wrote: >>>>> >>>>>> Hi, guys, >>>>>> >>>>>> my goal is to do by crawls at 100 fetches per second, observing, of >>>>>> course, >>>>>> polite crawling. But, when URLs are all different domains, what >>>>>> theoretically would stop some software from downloading from 100 >>>>>> domains >>>>>> at >>>>>> once, achieving the desired speed? >>>>>> >>>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even if it >>>>>> starts at a few dozen URLs/second, it slows down at the end (as >>>>>> discussed >>>>>> by >>>>>> many and by Krugler). >>>>>> >>>>>> Should I write something of my own, or are their fast crawlers? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Mark >>>>>> >>>>>> >> -- DigitalPebble Ltd http://www.digitalpebble.com