Trunk?  Map reduce?  Could you describe your box
setup, job division, and maybe post your
conf/nutch-site.xml file?

Just trying to get things going and not have much luck
with the mapreduce branch.  I also tried trunk, the
crawl stops around 30000 pages (out of maybe a million
), and once it's done I can't get results to show up
via tomcat.

Thanks,
Earl


--- Byron Miller <[EMAIL PROTECTED]> wrote:

> For what its worth i fetch my segments of 1 million
> urls with 80 threads at a time and no slow downs.
> 
> 
> I'll grab some of my stats and publish them, but i
> haven't had problems with fetcher slowing down like
> this in a long time.
> 
> (linux/Centos 4.2 platform)
> 
> -byron
> 
> --- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> 
> > Ken Krugler wrote:
> > 
> > >> I'm using the mapred branch on a FreeBSD 7.0
> box
> > to do fetchs of a 
> > >> 300k url list.
> > >>
> > >> Initially, its able to reach ~25 pages/s with
> 150
> > threads. The 
> > >> fetcher gets progressivly slower though,
> dropping
> > down to about ~15 
> > >> pages/s after about 2-3 hours or so and
> continues
> > to slow down. I've 
> > >> seen a few references on these lists to the
> > issue, but I'm not clear 
> > >> on if its expected behaviour or if there's a
> > solution to it? I've 
> > >> also noticed that the process takes up more and
> > more memory as it 
> > >> runs, is this expected as well?
> > >
> > >
> > > We've run into a similar situation, though we're
> > using Nutch 0.7. What 
> > > seems to be happening is that a host is slowly
> > trickling data back to 
> > > us. This happens when we're trying to releasing
> > the connection, and we 
> > > get stuck in the commons-httpclient code at 
> > > ChunkedInputStream.exhaustInputStream().
> > >
> > > I have a theory that this happens when our http
> > protocol max size 
> > > limit is hit. The protocol-httpclient plugin
> reads
> > up to the limit (in 
> > > our case, 1MB) and then tries to release the
> > connection, but for some 
> > > reason the host keeps sending us data, albeit at
> > some very slow rate. 
> > > I was seeing 30Kbits/second or so.
> > >
> > > Anyway, I've added the commons-httpclient code
> to
> > my project and am 
> > > plugging in some additional logging to help
> track
> > down the issue.
> > 
> > 
> > I would appreciate any feedback. Please also note
> > that you need to 
> > eliminate other factors, like the limit of threads
> > per host, but most 
> > notably the overhead of parsing - please use the
> > -noParse flag to 
> > fetcher for all those experiments. In the past it
> > was common for the 
> > fetcher to be stuck in a buggy parser plugin, so
> you
> > will need to 
> > eliminate this factor.
> > 
> > -- 
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _  
> > __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval,
> > Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System
> > Integration
> > http://www.sigram.com  Contact: info at sigram dot
> > com
> > 
> > 
> > 
> 
> 



        
                
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Reply via email to