> 
>   I know a lot of people have seen this problem, but I have not
> run into it.  I ran a crawl of about 100m pages back in August
> with good luck.
> 
>   On a two-Xeon box with ~2 gigs of RAM, I would run a fetcher of
> 200 threads.  As Doug says, it took a little while to get up to
> speed.  But after 30 mins of operation, I got sustained 80 pages/sec
> until completion.
> 
>   One thing I did was turn off the PDF parsing.  As mentioned
> before, there was a bug in the PDF parser we used, which often
> caused threads to hang.
> 
>   Next time you see this problem, get a stack trace of the VM.
> (kill -3 <pid> on Unix).  See if there are threads stuck in the
> pdfbox code.
> 
>   --Mike
> 
> 
> 
> On Thu, 2004-09-16 at 15:32, Sandhu, Jagdeep wrote:
> > The problem of Fetcher threads hanging happens quite often. In small crawlers (not 
> > Nutch) that I have written, I used to sporadically experience a thread hang 
> > waiting for a socket read. I solved the problem using setSoTimeout() method of the 
> > Socket class - This is not an elegant programming solution:( 
> > 
> > The more elegant method would be to use NIO socket channels. I will try to propose 
> > Fetcher refactoring using NIO and concurrent package in JDK 1.5 in the next couple 
> > of months. 
> > 
> > --Jagdeep
> > 
> > 
> > -----Original Message-----
> > From:       [EMAIL PROTECTED] on behalf of Doug Cutting
> > Sent:       Thu 9/16/2004 2:47 PM
> > To: [EMAIL PROTECTED]
> > Cc: 
> > Subject:    Re: [Nutch-dev] fetcher continuously slowing down
> > Does the fetcher ever complete?  If not, some of the fetcher threads 
> > could be stuck.  There was a bug in the PDF parser which caused it to 
> > hang on some documents.  So, as you encounter more of such pages, your 
> > crawl would slow as more threads get stuck.
> > 
> > You might also try more threads.  I've noticed that it sometime takes a 
> > few minutes for the fetcher to "settle down", so that its initial 
> > performance is not representative of the overall.
> > 
> > Doug
> > 
> > [EMAIL PROTECTED] wrote:
> > > Hello,
> > > 
> > > I am using Nutch 0.5 and am wondering whether anyone noticed that
> > > fetcher sometimes continuously slows down, from the moment it was
> > > started?
> > > 
> > > I am using 10 threads, and I noticed that the fetcher started with
> > > about 100KB/second, went up to 200kb/second, and then the crawling rate
> > > started continously going down. After half a day it was crawling at a
> > > rate of 30KB/second.  The fetch list consists of a number of random
> > > hosts, so I don't think this should be caused by the delay between
> > > requests to the the host.  There was no other netowork traffic on my
> > > server.  Of course, there could be something external to my machine and
> > > network card, but I couldn't check that.
> > > 
> > > Has anyone seen this with Nutch?  Should I suspect Nutch, or something
> > > local to my installation or even external to my machine?
> > > 
> > > Thanks,
> > > Otis
> > > 
> > > 
> > > 
> > > -------------------------------------------------------
> > > This SF.Net email is sponsored by: thawte's Crypto Challenge Vl
> > > Crack the code and win a Sony DCRHC40 MiniDV Digital Handycam
> > > Camcorder. More prizes in the weekly Lunch Hour Challenge.
> > > Sign up NOW http://ad.doubleclick.net/clk;10740251;10262165;m
> > > _______________________________________________
> > > Nutch-developers mailing list
> > > [EMAIL PROTECTED]
> > > https://lists.sourceforge.net/lists/listinfo/nutch-developers
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
> > Project Admins to receive an Apple iPod Mini FREE for your judgement on
> > who ports your project to Linux PPC the best. Sponsored by IBM.
> > Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
> > _______________________________________________
> > Nutch-developers mailing list
> > [EMAIL PROTECTED]
> > https://lists.sourceforge.net/lists/listinfo/nutch-developers
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
> > Project Admins to receive an Apple iPod Mini FREE for your judgement on
> > who ports your project to Linux PPC the best. Sponsored by IBM.
> > Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
> > _______________________________________________
> > Nutch-admin mailing list
> > [EMAIL PROTECTED]
> > https://lists.sourceforge.net/lists/listinfo/nutch-admin
> 
> 




-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to