See https://issues.apache.org/jira/browse/NUTCH-570 for something relevant.

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Raymond Balmès <[email protected]>
> To: [email protected]
> Sent: Wednesday, May 27, 2009 9:43:02 AM
> Subject: Re: threads get stuck in spinwaiting
> 
> maybe the problem is not in the fetcher but rather in the generate fetch
> list phase where it should take care in not sticking all URLs to the same
> domain together.
> 
> -Ray-
> 
> 2009/5/27 Larsson85 
> 
> >
> > You're probably right that it has something to do with the politeness. I
> > didn't notice it before, but now when you mention it I can see that all the
> > pages it's fetching at the end of the crawl is from the same domain. Is
> > there any way to turn of the politeness, or perhaps make it less polite to
> > speed things up? I've been doing a test run today, and the result is that
> > it
> > has been stuck in this spinwaiting state for about 3 hours, which is not
> > acceptable.
> >
> > Perhaps it is that I'm using a to small url-list to start with. I'm using
> > the dmoz list from the nutch tutorial, and I have a filter on .se and .nu
> > domains which probably disqualifies a lot of the urls in the list. Any tip
> > on where to get a bigger list? And most important, any tip on how I can
> > turn
> > off the politeness, or atleast make it less polite.
> > Thanks for all the help.
> >
> >
> > Raymond Balmès wrote:
> > >
> > > Observing what my crawls do, I believe Ken must be right.
> > > Towards the end of the crawl (when the fetchqueues.totalSize="xxxx"
> > counts
> > > down) in some cases I'm only fetching on two sites roughly , so indeed
> > the
> > > politeness starts to play a role there at least it should.
> > >
> > > -Ray-
> > >
> > > 2009/5/26 Raymond Balmès 
> > >
> > >> Please read this too :
> > >>
> > >>
> > 
> http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/
> > >>
> > >> Interesting build from ken.
> > >>
> > >> 2009/5/26 Raymond Balmès 
> > >>
> > >>  yes already reported in multiple-threads.
> > >>> I noted that if one does a "recrawl" you don't get this behavior... no
> > >>> idea why.
> > >>>
> > >>> -Raymond-
> > >>>
> > >>> 2009/5/26 Larsson85 
> > >>>
> > >>>
> > >>>> When I try to do my crawl it seems like the threads get stuck in som
> > >>>> spinwaiting mode. At first the crawl goes as planned, and I couldnt be
> > >>>> happier. But after som time, it starts reporting more of these
> > >>>> spinwaiting
> > >>>> messages.
> > >>>>
> > >>>> I print a log here to show you what it looks like. As you can see it
> > >>>> gets
> > >>>> stuck, and the queue decrease by 1 all the time. I've tried doing a
> > >>>> smaller
> > >>>> crawl, and what happends is that it counts down untill the
> > >>>> "fetchQueues.totalSize" reaches 0, and then the crawl is done.
> > >>>>
> > >>>> But the problem is that this countdown is very slow,there's no
> > >>>> effective
> > >>>> crawling going on, not using eather bandwith or cpu power. Basicly,
> > >>>> this
> > >>>> costs way to much time, I cant let it go on like this for hours to be
> > >>>> done.
> > >>>> How can I fix this?
> > >>>>
> > >>>>
> > >>>> after about an hour of crawling this is what the log looks like
> > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526
> > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2526
> > >>>>  - fetching http://home.swipnet.se/~w-147200/
> > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525
> > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2525
> > >>>>  - fetching http://biphome.spray.se/alarsson/
> > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524
> > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524
> > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2524
> > >>>>  - fetching http://home.swipnet.se/~w-31853/html/
> > >>>>  -activeThreads=1000, spinWaiting=1000, fetchQueues.totalSize=2523
> > >>>>
> > >>>> ....
> > >>>>
> > >>>> --
> > >>>> View this message in context:
> > >>>>
> > 
> http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23723825.html
> > >>>> Sent from the Nutch - User mailing list archive at Nabble.com.
> > >>>>
> > >>>>
> > >>>
> > >>
> > >
> > >
> >
> > --
> > View this message in context:
> > 
> http://www.nabble.com/threads-get-stuck-in-spinwaiting-tp23723825p23742537.html
> >  Sent from the Nutch - User mailing list archive at Nabble.com.
> >
> >

Reply via email to