Thanks Julien,

I can confirm this patch works perfectly and does a good job of keeping a
good crawl rate.

We have doubled the rate of information retrieval by using a time limit on
the fetch queue.

Thanks,
Eran

On Mon, Nov 23, 2009 at 1:28 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Hi guys,
>
> I've separated both functionalities into separate patches on JIRA
> (NUTCH-769
> / NUTCH-770).
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2009/11/21 Julien Nioche <lists.digitalpeb...@gmail.com>
>
> > Hi Eran,
> >
> > There is currently no time limit implemented in the Fetcher. We
> implemented
> > one which worked quite well in combination with another mechanism which
> > clears the URLs from a pool if more than x successive exceptions have
> been
> > encountered. This limits cases where a site or domain is not responsive.
> >
> > I might try and submit a patch if I find the time next week, our code has
> > been heavily modified with the previous patches which have not been
> > committed to the trunk yet (NUTCH-753 / NUTCH-719 / NUTCH-658) so I'd
> need
> > to spend a bit of time extracting this specific functionality from the
> rest.
> >
> > Best,
> >
> > Julien
> > --
> > DigitalPebble Ltd
> > http://www.digitalpebble.com
> >
> >
> > 2009/11/21 Eran Zinman <zze...@gmail.com>
> >
> > Hi,
> >>
> >> We've been using Nutch for focused crawling (right now we are crawling
> >> about
> >> 50 domains).
> >>
> >> We've encountered the long-tail problem - We've set TopN to 100,000 and
> >> generate.max.per.host to about 1500.
> >>
> >> 90% of all domains finish fetching after 30min, and the other 10% takes
> an
> >> additional 2.5 hours - making the slowest domain the bottleneck of the
> >> entire fetch process.
> >>
> >> I've read Ken Krugler document and he's describing the same problem:
> >>
> >>
> http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/
> >>
> >> I'm wondering - does anyone have a suggestion on what's the best way to
> >> tackle this issue?
> >>
> >> I think that Ken suggested to limit the fetch time - for example say
> >> "terminate after 1 hour, even if you are not done yet", is that feature
> >> available in Nutch?
> >>
> >> I will be happy to try and contribute code if required!
> >>
> >> Thanks,
> >> Eran
> >>
> >
> >
>

Reply via email to