I have to say that I'm still puzzled. Here is the latest. I just restarted a
run and then guess what :

got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get
3Mbit/s max before (nota bits and not bytes as I said before).
A few samples show that I was running at 50 Fetches/sec ... not bad. But why
this high-speed on this run I haven't got the faintest idea.


Than it drops and I get that kind of logs

2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=120
2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=120
2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516

Don't fully understand why it is oscillating between two queue size never
mind.... but it is likely the end of the run since hadoop shows 99.99%
percent complete for the 2 map it generated.

Would that be explained by a better URL mix ????

2009/11/25 Mark Kerzner <markkerz...@gmail.com>

> Judging by how this discussion goes, there may be a need for URL mix
> optimizer and for a fast crawler based on that. Is this something worth
> pursuing. MilleBii, q'en pensez vous?
>
> Mark
>
> On Wed, Nov 25, 2009 at 3:44 PM, MilleBii <mille...@gmail.com> wrote:
>
> > The logs show that my fetch queue is full and my 100 threads are mostly
> > spin
> > waiting towards the end.
> >
> > Now the very last run (150kURLs) I can clearly see 4 phases:
> > + very high speed : 3MB/s  for a few minutes
> > + sudden speed drop around 1MB/s and flat for several hours
> > + another speed drop to around 400kB/s for several hours
> > + another speed drop to around  200kB/s for a few hours two.
> >
> > So probably it is just a consequence of the url mix which isn't that good
> > nota: I have limited to 1000 URLS per host, and there are about 20-30
> hosts
> > in the mix which get limited that way.
> >
> > May be there is better mix of URLs possible ?
> >
> > 2009/11/25 Julien Nioche <lists.digitalpeb...@gmail.com>
> >
> > > or it is stuck on a couple of hosts which time out? The logs should
> have
> > a
> > > trace with the number of active threads, which should give some
> > indication
> > > of what's happening.
> > >
> > > Julien
> > >
> > >
> > > 2009/11/25 Dennis Kubes <ku...@apache.org>
> > >
> > > > If it is waiting and the box is idle, my first though is not dns.  I
> > just
> > > > put that up as one of the things people will run into.  Most likely
> it
> > is
> > > > uneven distribution of urls or something like that.
> > > >
> > > > Dennis
> > > >
> > > >
> > > > MilleBii wrote:
> > > >
> > > >> Get your point... Although I thought high number of threads would do
> > > >> exactly the same. Maybe I miss something.
> > > >>
> > > >> During my fetcher runs used bandwidth gets low pretty quickly, disk
> > > >> I/O is low, the CPU is low... So it must be waiting for something
> but
> > > >> what ?
> > > >>
> > > >> Could be the DNS cache wich is full and any new request gets
> forwarded
> > > >> to the master DNS of my ISP,
> > > >> Any idea how to check that ? I'm not familiar with Bind myself...
> What
> > > >> is the typical rate you can get how many dns request/s ?
> > > >>
> > > >>
> > > >>
> > > >> 2009/11/25, Dennis Kubes <ku...@apache.org>:
> > > >>
> > > >>> It is not about the local DNS caching as much as having local DNS
> > > >>> servers.  Too many fetchers hitting a centralized DNS server can
> act
> > as
> > > >>> a DOS attack and slow down the entire fetching system.
> > > >>>
> > > >>> For example say I have a single centralized DNS server for my
> > network.
> > > >>> And say I have 2 map task per machine, 50 machines, 20 threads per
> > > task.
> > > >>>  That would be 50 * 2 * 20 = 2000 fetchers.  Meaning a possibility
> of
> > > >>> 2000  DNS requests / sec.  Most local DNS servers for smaller
> > networks
> > > >>> can't handle that.  If everything is hitting a centralized DNS and
> > that
> > > >>> DNS takes 1-3 sec per request because of too many requests.  The
> > entire
> > > >>> fetching system stalls.
> > > >>>
> > > >>> Hitting a secondary larger cache, such as OpenDNS, can have an
> effect
> > > >>> because you are making one hop to get the name versus multiple hops
> > to
> > > >>> root servers then domain servers.
> > > >>>
> > > >>> Working off of a single server these issues don't show up as much
> > > >>> because there aren't enough fetchers.
> > > >>>
> > > >>> Dennis Kubes
> > > >>>
> > > >>> MilleBii wrote:
> > > >>>
> > > >>>> Why would DNS local caching work... It only is working if you are
> > > >>>> going to crawl often the same site ... In which case you are hit
> by
> > > >>>> the politeness.
> > > >>>>
> > > >>>> if you have segments with only/mainly different sites it is
> > not/really
> > > >>>> going to help.
> > > >>>>
> > > >>>> So far I have not seen my quad core + 100mb/s + pseudo distributed
> > > >>>> hadoop  going faster than 10 fetch / s... Let me check the DNS and
> I
> > > >>>> will tell you.
> > > >>>>
> > > >>>> I vote for 100 Fetch/s not sure how to get it though
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> 2009/11/24, Dennis Kubes <ku...@apache.org>:
> > > >>>>
> > > >>>>> Hi Mark,
> > > >>>>>
> > > >>>>> I just put this up on the wiki.  Hope it helps:
> > > >>>>>
> > > >>>>> http://wiki.apache.org/nutch/OptimizingCrawls
> > > >>>>>
> > > >>>>> Dennis
> > > >>>>>
> > > >>>>>
> > > >>>>> Mark Kerzner wrote:
> > > >>>>>
> > > >>>>>> Hi, guys,
> > > >>>>>>
> > > >>>>>> my goal is to do by crawls at 100 fetches per second, observing,
> > of
> > > >>>>>> course,
> > > >>>>>> polite crawling. But, when URLs are all different domains, what
> > > >>>>>> theoretically would stop some software from downloading from 100
> > > >>>>>> domains
> > > >>>>>> at
> > > >>>>>> once, achieving the desired speed?
> > > >>>>>>
> > > >>>>>> But, whatever I do, I can't make Nutch crawl at that speed. Even
> > if
> > > it
> > > >>>>>> starts at a few dozen URLs/second, it slows down at the end (as
> > > >>>>>> discussed
> > > >>>>>> by
> > > >>>>>> many and by Krugler).
> > > >>>>>>
> > > >>>>>> Should I write something of my own, or are their fast crawlers?
> > > >>>>>>
> > > >>>>>> Thanks!
> > > >>>>>>
> > > >>>>>> Mark
> > > >>>>>>
> > > >>>>>>
> > > >>
> > >
> > >
> > > --
> > > DigitalPebble Ltd
> > > http://www.digitalpebble.com
> > >
> >
> >
> >
> > --
> > -MilleBii-
> >
>



-- 
-MilleBii-

Reply via email to