Ah, look, look! https://issues.apache.org/jira/browse/NUTCH-629
I guess I did upload a patch aimed at this problem a while back. As you can see, it's not perfect, but it did help me some. The real fix will have to implement a combination of things, like I mentioned in the email below. Plus, ideally, Nutch would have a "Host DB" to keep track of hosts, as described in https://issues.apache.org/jira/browse/NUTCH-628 . Armed with that, the fetchlist generation could be smarter and try to generate "more even" fetchlist. By "more even" I mean that it would use what it knows about hosts (e.g. how fast they are, how many URLs they have in Crawl DB...) to generate a fetchlist that doesn't have a mix of very fast and very slow servers, and doesn't produce a fetchlist where some hosts have a large number of URLs and some have very few URLs. In other words, if we can get Nutch to produce a fetchlist with a roughly the same number of URLs for each host, and produce a fetchlist that includes only hosts with similar fetch speed in very recent history, then we would avoid the situation where fetching takes a long time and fetch speed drops. Can anyone produce a patch based on this? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Otis Gospodnetic <[email protected]> > To: [email protected] > Sent: Wednesday, May 27, 2009 11:38:48 PM > Subject: Re: threads get stuck in spinwaiting > > > Ray, > > I don't think fetchlist generation sticks URLs from the same domain or host > together. But URLs for the same host do end up in the same queue. This is > by > design and it is a good thing -- this is how Nutch can ensure not to hit the > same host with more simultaneous threads than it should (typically 1 - > Larsson85 > - you really want to change that back to 1 as Ken described). > > It is normal and to be expected that at the end of the crawl you will end up > with a number of URLs from a smaller nuber of hosts -- typically hosts that > had > more URLs in the fetchlist than other hosts, or hosts that are slower, so > fetching from them take a long time. > > Here is a visual example. > We have 3 queues. > Each queue holds URLs for a single host. > Queue B is a queue for a host with the most URLs. > Say that Queue A is a queue for a host that's fast (fetching from it is > fast), > and say that Queue B is a queue for a slow host. > > Queue A: A A A A A A A A A A A A A A > Queue B: B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B > B B > Queue C: C C C C C C C C C > > So let's see what happens. Say that fetcher has 3 threads (tA, tB, tC), each > thread in charge of one queue. > Since Queue A is for a fast host, and it's not the biggest, tA will be done > with > Queue A first. When that happens, when tA is all done with Queue A, tB will > still be working on URLs from Queue B, because that queue was much bigger. > And > tC might still be working on URLs from Queue C because that host was slow > (for > example, it could be that fetching a single URL from C takes 30 seconds, > while > URLs from A are fetched at 1 per second). > > So what happens with the download speed? It goes down because of the slow C. > > And it goes down because the Fetcher is not fetching at all times - it has to > obey the delay between fetches in order to remain polite. > And what happens to the whole fetch run? It takes foreeeeeeeeeever because > of > the above. > > So what I believe Ken did in his fetcher is that he simply specifies how much > time he wants the whole fetch run to last. So if C is uper slow, it doesn't > matter. At some point the fetch run will end, and whatever URLs from C did > not > get fetched will have to wait until the next fetch run. And the same with B. > Plus, I think he also said he can make the fetcher sleep less between > requests, > which can help the fetcher go through more of those B URLs. > > And I think this is really the whole story. > > So I think some possible solutions are: > 1. introduce a notion of a timed fetch run as described above > 2. keep reducing the sleep time between requests at some point, again as > described above > 3. consider measuring fetch speed per host and cutting off slow hosts (I > think I > had that once in my local Nutch copy, but it looks like I no longer have it) > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- > > From: Raymond Balmès > > To: [email protected] > > Sent: Wednesday, May 27, 2009 9:43:02 AM > > Subject: Re: threads get stuck in spinwaiting > > > > maybe the problem is not in the fetcher but rather in the generate fetch > > list phase where it should take care in not sticking all URLs to the same > > domain together. > > > > -Ray- > > > > 2009/5/27 Larsson85 > > > > > > > > You're probably right that it has something to do with the politeness. I > > > didn't notice it before, but now when you mention it I can see that all > > > the > > > pages it's fetching at the end of the crawl is from the same domain. Is > > > there any way to turn of the politeness, or perhaps make it less polite to > > > speed things up? I've been doing a test run today, and the result is that > > > it > > > has been stuck in this spinwaiting state for about 3 hours, which is not > > > acceptable. > > > > > > Perhaps it is that I'm using a to small url-list to start with. I'm using > > > the dmoz list from the nutch tutorial, and I have a filter on .se and .nu > > > domains which probably disqualifies a lot of the urls in the list. Any tip > > > on where to get a bigger list? And most important, any tip on how I can > > > turn > > > off the politeness, or atleast make it less polite. > > > Thanks for all the help. > > > > > > > > > Raymond Balmès wrote: > > > > > > > > Observing what my crawls do, I believe Ken must be right. > > > > Towards the end of the crawl (when the fetchqueues.totalSize="xxxx" > > > counts > > > > down) in some cases I'm only fetching on two sites roughly , so indeed > > > the > > > > politeness starts to play a role there at least it should. > > > > > > > > -Ray- > > > > > > > > 2009/5/26 Raymond Balmès > > > > > > > >> Please read this too : > > > >> > > > >> > > > > > > http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ > > > >> > > > >> Interesting build from ken. > > > >> > > > >> 2009/5/26 Raymond Balmès > > > >> > > > >> yes already reported in multiple-threads. > > > >>> I noted that if one does a "recrawl" you don't get this behavior... no > > > >>> idea why. > > > >>> > > > >>> -Raymond- > > > >>> > > > >>> 2009/5/26 Larsson85 > > > >>> > > > >>> > > > >>>> When I try to do my crawl it seems like the threads get stuck in som > > > >>>> spinwaiting mode. At first the crawl goes as planned, and I couldnt > > > >>>> be > > > >>>> happier. But after som time, it starts reporting more of these > > > >>>> spinwaiting > > > >>>> messages. > > > >>>> > > > >>>> I print a log here to show you what it looks like. As you can see it > > > >>>> gets > > > >>>> stuck, and the queue decrease by 1 all the time. I've tried doing a > > > >>>> smaller > > > >>>> crawl, and what happends is that it counts down untill the > > > >>>> "fetchQueues.totalSize" reaches 0, and then the crawl is done. > > > >>>> > > > >>>> But the problem is that this countdown is very slow,there's no > > > >>>> effective > > > >>>> crawling going on, not using eather bandwith or cpu power. Basicly, > > > >>>> this > > > >>>> costs way to much time, I cant let it go on like this for hours to be > > > >>>> done. > > > >>>> How can I fix this? > > > >>>> > > > >>>> > > > >>>> after about an hour of crawling this is what the log looks like > > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526 > > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526 > > > >>>> - fetching http://home.swipnet.se/~w-147200/ > > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525 > > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525 > > > >>>> - fetching http://biphome.spray.se/alarsson/ > > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 > > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 > > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 > > > >>>> - fetching http://home.swipnet.se/~w-31853/html/ > > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2523 > > > >>>> > > > >>>> .... > > > >>>> > > > >>>> -- > > > >>>> View this message in context: > > > >>>> > > > > > > http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23723825.html > > > >>>> Sent from the Nutch - User mailing list archive at Nabble.com. > > > >>>> > > > >>>> > > > >>> > > > >> > > > > > > > > > > > > > > -- > > > View this message in context: > > > > > > http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23742537.html > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > >
