I have to say that I'm still puzzled. Here is the latest. I just restarted a run and then guess what :
got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get 3Mbit/s max before (nota bits and not bytes as I said before). A few samples show that I was running at 50 Fetches/sec ... not bad. But why this high-speed on this run I haven't got the faintest idea. Than it drops and I get that kind of logs 2009-11-25 23:28:28,584 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 2009-11-25 23:28:29,227 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=120 2009-11-25 23:28:29,584 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 2009-11-25 23:28:30,227 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=120 2009-11-25 23:28:30,585 INFO fetcher.Fetcher - -activeThreads=100, spinWaiting=100, fetchQueues.totalSize=516 Don't fully understand why it is oscillating between two queue size never mind.... but it is likely the end of the run since hadoop shows 99.99% percent complete for the 2 map it generated. Would that be explained by a better URL mix ???? 2009/11/25 Mark Kerzner <markkerz...@gmail.com> > Judging by how this discussion goes, there may be a need for URL mix > optimizer and for a fast crawler based on that. Is this something worth > pursuing. MilleBii, q'en pensez vous? > > Mark > > On Wed, Nov 25, 2009 at 3:44 PM, MilleBii <mille...@gmail.com> wrote: > > > The logs show that my fetch queue is full and my 100 threads are mostly > > spin > > waiting towards the end. > > > > Now the very last run (150kURLs) I can clearly see 4 phases: > > + very high speed : 3MB/s for a few minutes > > + sudden speed drop around 1MB/s and flat for several hours > > + another speed drop to around 400kB/s for several hours > > + another speed drop to around 200kB/s for a few hours two. > > > > So probably it is just a consequence of the url mix which isn't that good > > nota: I have limited to 1000 URLS per host, and there are about 20-30 > hosts > > in the mix which get limited that way. > > > > May be there is better mix of URLs possible ? > > > > 2009/11/25 Julien Nioche <lists.digitalpeb...@gmail.com> > > > > > or it is stuck on a couple of hosts which time out? The logs should > have > > a > > > trace with the number of active threads, which should give some > > indication > > > of what's happening. > > > > > > Julien > > > > > > > > > 2009/11/25 Dennis Kubes <ku...@apache.org> > > > > > > > If it is waiting and the box is idle, my first though is not dns. I > > just > > > > put that up as one of the things people will run into. Most likely > it > > is > > > > uneven distribution of urls or something like that. > > > > > > > > Dennis > > > > > > > > > > > > MilleBii wrote: > > > > > > > >> Get your point... Although I thought high number of threads would do > > > >> exactly the same. Maybe I miss something. > > > >> > > > >> During my fetcher runs used bandwidth gets low pretty quickly, disk > > > >> I/O is low, the CPU is low... So it must be waiting for something > but > > > >> what ? > > > >> > > > >> Could be the DNS cache wich is full and any new request gets > forwarded > > > >> to the master DNS of my ISP, > > > >> Any idea how to check that ? I'm not familiar with Bind myself... > What > > > >> is the typical rate you can get how many dns request/s ? > > > >> > > > >> > > > >> > > > >> 2009/11/25, Dennis Kubes <ku...@apache.org>: > > > >> > > > >>> It is not about the local DNS caching as much as having local DNS > > > >>> servers. Too many fetchers hitting a centralized DNS server can > act > > as > > > >>> a DOS attack and slow down the entire fetching system. > > > >>> > > > >>> For example say I have a single centralized DNS server for my > > network. > > > >>> And say I have 2 map task per machine, 50 machines, 20 threads per > > > task. > > > >>> That would be 50 * 2 * 20 = 2000 fetchers. Meaning a possibility > of > > > >>> 2000 DNS requests / sec. Most local DNS servers for smaller > > networks > > > >>> can't handle that. If everything is hitting a centralized DNS and > > that > > > >>> DNS takes 1-3 sec per request because of too many requests. The > > entire > > > >>> fetching system stalls. > > > >>> > > > >>> Hitting a secondary larger cache, such as OpenDNS, can have an > effect > > > >>> because you are making one hop to get the name versus multiple hops > > to > > > >>> root servers then domain servers. > > > >>> > > > >>> Working off of a single server these issues don't show up as much > > > >>> because there aren't enough fetchers. > > > >>> > > > >>> Dennis Kubes > > > >>> > > > >>> MilleBii wrote: > > > >>> > > > >>>> Why would DNS local caching work... It only is working if you are > > > >>>> going to crawl often the same site ... In which case you are hit > by > > > >>>> the politeness. > > > >>>> > > > >>>> if you have segments with only/mainly different sites it is > > not/really > > > >>>> going to help. > > > >>>> > > > >>>> So far I have not seen my quad core + 100mb/s + pseudo distributed > > > >>>> hadoop going faster than 10 fetch / s... Let me check the DNS and > I > > > >>>> will tell you. > > > >>>> > > > >>>> I vote for 100 Fetch/s not sure how to get it though > > > >>>> > > > >>>> > > > >>>> > > > >>>> 2009/11/24, Dennis Kubes <ku...@apache.org>: > > > >>>> > > > >>>>> Hi Mark, > > > >>>>> > > > >>>>> I just put this up on the wiki. Hope it helps: > > > >>>>> > > > >>>>> http://wiki.apache.org/nutch/OptimizingCrawls > > > >>>>> > > > >>>>> Dennis > > > >>>>> > > > >>>>> > > > >>>>> Mark Kerzner wrote: > > > >>>>> > > > >>>>>> Hi, guys, > > > >>>>>> > > > >>>>>> my goal is to do by crawls at 100 fetches per second, observing, > > of > > > >>>>>> course, > > > >>>>>> polite crawling. But, when URLs are all different domains, what > > > >>>>>> theoretically would stop some software from downloading from 100 > > > >>>>>> domains > > > >>>>>> at > > > >>>>>> once, achieving the desired speed? > > > >>>>>> > > > >>>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even > > if > > > it > > > >>>>>> starts at a few dozen URLs/second, it slows down at the end (as > > > >>>>>> discussed > > > >>>>>> by > > > >>>>>> many and by Krugler). > > > >>>>>> > > > >>>>>> Should I write something of my own, or are their fast crawlers? > > > >>>>>> > > > >>>>>> Thanks! > > > >>>>>> > > > >>>>>> Mark > > > >>>>>> > > > >>>>>> > > > >> > > > > > > > > > -- > > > DigitalPebble Ltd > > > http://www.digitalpebble.com > > > > > > > > > > > -- > > -MilleBii- > > > -- -MilleBii-