Ray, I don't think fetchlist generation sticks URLs from the same domain or host together. But URLs for the same host do end up in the same queue. This is by design and it is a good thing -- this is how Nutch can ensure not to hit the same host with more simultaneous threads than it should (typically 1 - Larsson85 - you really want to change that back to 1 as Ken described).
It is normal and to be expected that at the end of the crawl you will end up with a number of URLs from a smaller nuber of hosts -- typically hosts that had more URLs in the fetchlist than other hosts, or hosts that are slower, so fetching from them take a long time. Here is a visual example. We have 3 queues. Each queue holds URLs for a single host. Queue B is a queue for a host with the most URLs. Say that Queue A is a queue for a host that's fast (fetching from it is fast), and say that Queue B is a queue for a slow host. Queue A: A A A A A A A A A A A A A A Queue B: B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B B Queue C: C C C C C C C C C So let's see what happens. Say that fetcher has 3 threads (tA, tB, tC), each thread in charge of one queue. Since Queue A is for a fast host, and it's not the biggest, tA will be done with Queue A first. When that happens, when tA is all done with Queue A, tB will still be working on URLs from Queue B, because that queue was much bigger. And tC might still be working on URLs from Queue C because that host was slow (for example, it could be that fetching a single URL from C takes 30 seconds, while URLs from A are fetched at 1 per second). So what happens with the download speed? It goes down because of the slow C. And it goes down because the Fetcher is not fetching at all times - it has to obey the delay between fetches in order to remain polite. And what happens to the whole fetch run? It takes foreeeeeeeeeever because of the above. So what I believe Ken did in his fetcher is that he simply specifies how much time he wants the whole fetch run to last. So if C is uper slow, it doesn't matter. At some point the fetch run will end, and whatever URLs from C did not get fetched will have to wait until the next fetch run. And the same with B. Plus, I think he also said he can make the fetcher sleep less between requests, which can help the fetcher go through more of those B URLs. And I think this is really the whole story. So I think some possible solutions are: 1. introduce a notion of a timed fetch run as described above 2. keep reducing the sleep time between requests at some point, again as described above 3. consider measuring fetch speed per host and cutting off slow hosts (I think I had that once in my local Nutch copy, but it looks like I no longer have it) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Raymond Balmès <[email protected]> > To: [email protected] > Sent: Wednesday, May 27, 2009 9:43:02 AM > Subject: Re: threads get stuck in spinwaiting > > maybe the problem is not in the fetcher but rather in the generate fetch > list phase where it should take care in not sticking all URLs to the same > domain together. > > -Ray- > > 2009/5/27 Larsson85 > > > > > You're probably right that it has something to do with the politeness. I > > didn't notice it before, but now when you mention it I can see that all the > > pages it's fetching at the end of the crawl is from the same domain. Is > > there any way to turn of the politeness, or perhaps make it less polite to > > speed things up? I've been doing a test run today, and the result is that > > it > > has been stuck in this spinwaiting state for about 3 hours, which is not > > acceptable. > > > > Perhaps it is that I'm using a to small url-list to start with. I'm using > > the dmoz list from the nutch tutorial, and I have a filter on .se and .nu > > domains which probably disqualifies a lot of the urls in the list. Any tip > > on where to get a bigger list? And most important, any tip on how I can > > turn > > off the politeness, or atleast make it less polite. > > Thanks for all the help. > > > > > > Raymond Balmès wrote: > > > > > > Observing what my crawls do, I believe Ken must be right. > > > Towards the end of the crawl (when the fetchqueues.totalSize="xxxx" > > counts > > > down) in some cases I'm only fetching on two sites roughly , so indeed > > the > > > politeness starts to play a role there at least it should. > > > > > > -Ray- > > > > > > 2009/5/26 Raymond Balmès > > > > > >> Please read this too : > > >> > > >> > > > http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ > > >> > > >> Interesting build from ken. > > >> > > >> 2009/5/26 Raymond Balmès > > >> > > >> yes already reported in multiple-threads. > > >>> I noted that if one does a "recrawl" you don't get this behavior... no > > >>> idea why. > > >>> > > >>> -Raymond- > > >>> > > >>> 2009/5/26 Larsson85 > > >>> > > >>> > > >>>> When I try to do my crawl it seems like the threads get stuck in som > > >>>> spinwaiting mode. At first the crawl goes as planned, and I couldnt be > > >>>> happier. But after som time, it starts reporting more of these > > >>>> spinwaiting > > >>>> messages. > > >>>> > > >>>> I print a log here to show you what it looks like. As you can see it > > >>>> gets > > >>>> stuck, and the queue decrease by 1 all the time. I've tried doing a > > >>>> smaller > > >>>> crawl, and what happends is that it counts down untill the > > >>>> "fetchQueues.totalSize" reaches 0, and then the crawl is done. > > >>>> > > >>>> But the problem is that this countdown is very slow,there's no > > >>>> effective > > >>>> crawling going on, not using eather bandwith or cpu power. Basicly, > > >>>> this > > >>>> costs way to much time, I cant let it go on like this for hours to be > > >>>> done. > > >>>> How can I fix this? > > >>>> > > >>>> > > >>>> after about an hour of crawling this is what the log looks like > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526 > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526 > > >>>> - fetching http://home.swipnet.se/~w-147200/ > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525 > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525 > > >>>> - fetching http://biphome.spray.se/alarsson/ > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 > > >>>> - fetching http://home.swipnet.se/~w-31853/html/ > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2523 > > >>>> > > >>>> .... > > >>>> > > >>>> -- > > >>>> View this message in context: > > >>>> > > > http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23723825.html > > >>>> Sent from the Nutch - User mailing list archive at Nabble.com. > > >>>> > > >>>> > > >>> > > >> > > > > > > > > > > -- > > View this message in context: > > > http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23742537.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > >
