Ah, look, look!
https://issues.apache.org/jira/browse/NUTCH-629

I guess I did upload a patch aimed at this problem a while back.  As you can 
see, it's not perfect, but it did help me some.

The real fix will have to implement a combination of things, like I mentioned 
in the email below.
Plus, ideally, Nutch would have a "Host DB" to keep track of hosts, as 
described in https://issues.apache.org/jira/browse/NUTCH-628 .  Armed with 
that, the fetchlist generation could be smarter and try to generate "more even" 
fetchlist.
By "more even" I mean that it would use what it knows about hosts (e.g. how 
fast they are, how many URLs they have in Crawl DB...) to generate a fetchlist 
that doesn't have a mix of very fast and very slow servers, and doesn't produce 
a fetchlist where some hosts have a large number of URLs and some have very few 
URLs.
In other words, if we can get Nutch to produce a fetchlist with a roughly the 
same number of URLs for each host, and produce a fetchlist that includes only 
hosts with similar fetch speed in very recent history, then we would avoid the 
situation where fetching takes a long time and fetch speed drops.

Can anyone produce a patch based on this?
 
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Otis Gospodnetic <[email protected]>
> To: [email protected]
> Sent: Wednesday, May 27, 2009 11:38:48 PM
> Subject: Re: threads get stuck in spinwaiting
> 
> 
> Ray,
> 
> I don't think fetchlist generation sticks URLs from the same domain or host 
> together.  But URLs for the same host do end up in the same queue.  This is 
> by 
> design and it is a good thing -- this is how Nutch can ensure not to hit the 
> same host with more simultaneous threads than it should (typically 1 - 
> Larsson85 
> - you really want to change that back to 1 as Ken described).
> 
> It is normal and to be expected that at the end of the crawl you will end up 
> with a number of URLs from a smaller nuber of hosts -- typically hosts that 
> had 
> more URLs in the fetchlist than other hosts, or hosts that are slower, so 
> fetching from them take a long time.
> 
> Here is a visual example.
> We have 3 queues.
> Each queue holds URLs for a single host.
> Queue B is a queue for a host with the most URLs.
> Say that Queue A is a queue for a host that's fast (fetching from it is 
> fast), 
> and say that Queue B is a queue for a slow host.
> 
> Queue A: A A A A A A A A A A A A A A
> Queue B: B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B 
> B B
> Queue C: C C C C C C C C C
> 
> So let's see what happens.  Say that fetcher has 3 threads (tA, tB, tC), each 
> thread in charge of one queue.
> Since Queue A is for a fast host, and it's not the biggest, tA will be done 
> with 
> Queue A first.  When that happens, when tA is all done with Queue A, tB will 
> still be working on URLs from Queue B, because that queue was much bigger.  
> And 
> tC might still be working on URLs from Queue C because that host was slow 
> (for 
> example, it could be that fetching a single URL from C takes 30 seconds, 
> while 
> URLs from A are fetched at 1 per second).
> 
> So what happens with the download speed?  It goes down because of the slow C. 
>  
> And it goes down because the Fetcher is not fetching at all times - it has to 
> obey the delay between fetches in order to remain polite.
> And what happens to the whole fetch run?  It takes foreeeeeeeeeever because 
> of 
> the above.
> 
> So what I believe Ken did in his fetcher is that he simply specifies how much 
> time he wants the whole fetch run to last.  So if C is uper slow, it doesn't 
> matter.  At some point the fetch run will end, and whatever URLs from C did 
> not 
> get fetched will have to wait until the next fetch run.  And the same with B.
> Plus, I think he also said he can make the fetcher sleep less between 
> requests, 
> which can help the fetcher go through more of those B URLs.
> 
> And I think this is really the whole story.
> 
> So I think some possible solutions are:
> 1. introduce a notion of a timed fetch run as described above
> 2. keep reducing the sleep time between requests at some point, again as 
> described above
> 3. consider measuring fetch speed per host and cutting off slow hosts (I 
> think I 
> had that once in my local Nutch copy, but it looks like I no longer have it)
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
> > From: Raymond Balmès 
> > To: [email protected]
> > Sent: Wednesday, May 27, 2009 9:43:02 AM
> > Subject: Re: threads get stuck in spinwaiting
> > 
> > maybe the problem is not in the fetcher but rather in the generate fetch
> > list phase where it should take care in not sticking all URLs to the same
> > domain together.
> > 
> > -Ray-
> > 
> > 2009/5/27 Larsson85 
> > 
> > >
> > > You're probably right that it has something to do with the politeness. I
> > > didn't notice it before, but now when you mention it I can see that all 
> > > the
> > > pages it's fetching at the end of the crawl is from the same domain. Is
> > > there any way to turn of the politeness, or perhaps make it less polite to
> > > speed things up? I've been doing a test run today, and the result is that
> > > it
> > > has been stuck in this spinwaiting state for about 3 hours, which is not
> > > acceptable.
> > >
> > > Perhaps it is that I'm using a to small url-list to start with. I'm using
> > > the dmoz list from the nutch tutorial, and I have a filter on .se and .nu
> > > domains which probably disqualifies a lot of the urls in the list. Any tip
> > > on where to get a bigger list? And most important, any tip on how I can
> > > turn
> > > off the politeness, or atleast make it less polite.
> > > Thanks for all the help.
> > >
> > >
> > > Raymond Balmès wrote:
> > > >
> > > > Observing what my crawls do, I believe Ken must be right.
> > > > Towards the end of the crawl (when the fetchqueues.totalSize="xxxx"
> > > counts
> > > > down) in some cases I'm only fetching on two sites roughly , so indeed
> > > the
> > > > politeness starts to play a role there at least it should.
> > > >
> > > > -Ray-
> > > >
> > > > 2009/5/26 Raymond Balmès 
> > > >
> > > >> Please read this too :
> > > >>
> > > >>
> > > 
> > 
> http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/
> > > >>
> > > >> Interesting build from ken.
> > > >>
> > > >> 2009/5/26 Raymond Balmès 
> > > >>
> > > >>  yes already reported in multiple-threads.
> > > >>> I noted that if one does a "recrawl" you don't get this behavior... no
> > > >>> idea why.
> > > >>>
> > > >>> -Raymond-
> > > >>>
> > > >>> 2009/5/26 Larsson85 
> > > >>>
> > > >>>
> > > >>>> When I try to do my crawl it seems like the threads get stuck in som
> > > >>>> spinwaiting mode. At first the crawl goes as planned, and I couldnt 
> > > >>>> be
> > > >>>> happier. But after som time, it starts reporting more of these
> > > >>>> spinwaiting
> > > >>>> messages.
> > > >>>>
> > > >>>> I print a log here to show you what it looks like. As you can see it
> > > >>>> gets
> > > >>>> stuck, and the queue decrease by 1 all the time. I've tried doing a
> > > >>>> smaller
> > > >>>> crawl, and what happends is that it counts down untill the
> > > >>>> "fetchQueues.totalSize" reaches 0, and then the crawl is done.
> > > >>>>
> > > >>>> But the problem is that this countdown is very slow,there's no
> > > >>>> effective
> > > >>>> crawling going on, not using eather bandwith or cpu power. Basicly,
> > > >>>> this
> > > >>>> costs way to much time, I cant let it go on like this for hours to be
> > > >>>> done.
> > > >>>> How can I fix this?
> > > >>>>
> > > >>>>
> > > >>>> after about an hour of crawling this is what the log looks like
> > > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526
> > > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526
> > > >>>>  - fetching http://home.swipnet.se/~w-147200/
> > > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525
> > > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525
> > > >>>>  - fetching http://biphome.spray.se/alarsson/
> > > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524
> > > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524
> > > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524
> > > >>>>  - fetching http://home.swipnet.se/~w-31853/html/
> > > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2523
> > > >>>>
> > > >>>> ....
> > > >>>>
> > > >>>> --
> > > >>>> View this message in context:
> > > >>>>
> > > 
> > 
> http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23723825.html
> > > >>>> Sent from the Nutch - User mailing list archive at Nabble.com.
> > > >>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> > > >
> > >
> > > --
> > > View this message in context:
> > > 
> > 
> http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23742537.html
> > >  Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> > >

Reply via email to