See https://issues.apache.org/jira/browse/NUTCH-570 for something relevant.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Raymond Balmès <[email protected]> > To: [email protected] > Sent: Wednesday, May 27, 2009 9:43:02 AM > Subject: Re: threads get stuck in spinwaiting > > maybe the problem is not in the fetcher but rather in the generate fetch > list phase where it should take care in not sticking all URLs to the same > domain together. > > -Ray- > > 2009/5/27 Larsson85 > > > > > You're probably right that it has something to do with the politeness. I > > didn't notice it before, but now when you mention it I can see that all the > > pages it's fetching at the end of the crawl is from the same domain. Is > > there any way to turn of the politeness, or perhaps make it less polite to > > speed things up? I've been doing a test run today, and the result is that > > it > > has been stuck in this spinwaiting state for about 3 hours, which is not > > acceptable. > > > > Perhaps it is that I'm using a to small url-list to start with. I'm using > > the dmoz list from the nutch tutorial, and I have a filter on .se and .nu > > domains which probably disqualifies a lot of the urls in the list. Any tip > > on where to get a bigger list? And most important, any tip on how I can > > turn > > off the politeness, or atleast make it less polite. > > Thanks for all the help. > > > > > > Raymond Balmès wrote: > > > > > > Observing what my crawls do, I believe Ken must be right. > > > Towards the end of the crawl (when the fetchqueues.totalSize="xxxx" > > counts > > > down) in some cases I'm only fetching on two sites roughly , so indeed > > the > > > politeness starts to play a role there at least it should. > > > > > > -Ray- > > > > > > 2009/5/26 Raymond Balmès > > > > > >> Please read this too : > > >> > > >> > > > http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ > > >> > > >> Interesting build from ken. > > >> > > >> 2009/5/26 Raymond Balmès > > >> > > >> yes already reported in multiple-threads. > > >>> I noted that if one does a "recrawl" you don't get this behavior... no > > >>> idea why. > > >>> > > >>> -Raymond- > > >>> > > >>> 2009/5/26 Larsson85 > > >>> > > >>> > > >>>> When I try to do my crawl it seems like the threads get stuck in som > > >>>> spinwaiting mode. At first the crawl goes as planned, and I couldnt be > > >>>> happier. But after som time, it starts reporting more of these > > >>>> spinwaiting > > >>>> messages. > > >>>> > > >>>> I print a log here to show you what it looks like. As you can see it > > >>>> gets > > >>>> stuck, and the queue decrease by 1 all the time. I've tried doing a > > >>>> smaller > > >>>> crawl, and what happends is that it counts down untill the > > >>>> "fetchQueues.totalSize" reaches 0, and then the crawl is done. > > >>>> > > >>>> But the problem is that this countdown is very slow,there's no > > >>>> effective > > >>>> crawling going on, not using eather bandwith or cpu power. Basicly, > > >>>> this > > >>>> costs way to much time, I cant let it go on like this for hours to be > > >>>> done. > > >>>> How can I fix this? > > >>>> > > >>>> > > >>>> after about an hour of crawling this is what the log looks like > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526 > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526 > > >>>> - fetching http://home.swipnet.se/~w-147200/ > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525 > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525 > > >>>> - fetching http://biphome.spray.se/alarsson/ > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524 > > >>>> - fetching http://home.swipnet.se/~w-31853/html/ > > >>>> -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2523 > > >>>> > > >>>> .... > > >>>> > > >>>> -- > > >>>> View this message in context: > > >>>> > > > http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23723825.html > > >>>> Sent from the Nutch - User mailing list archive at Nabble.com. > > >>>> > > >>>> > > >>> > > >> > > > > > > > > > > -- > > View this message in context: > > > http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23742537.html > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > >
