Nutch - Focused crawling

2009-11-21 Thread Eran Zinman
Hi,

We've been using Nutch for focused crawling (right now we are crawling about
50 domains).

We've encountered the long-tail problem - We've set TopN to 100,000 and
generate.max.per.host to about 1500.

90% of all domains finish fetching after 30min, and the other 10% takes an
additional 2.5 hours - making the slowest domain the bottleneck of the
entire fetch process.

I've read Ken Krugler document and he's describing the same problem:
http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/

I'm wondering - does anyone have a suggestion on what's the best way to
tackle this issue?

I think that Ken suggested to limit the fetch time - for example say
terminate after 1 hour, even if you are not done yet, is that feature
available in Nutch?

I will be happy to try and contribute code if required!

Thanks,
Eran


Re: Nutch - Focused crawling

2009-11-21 Thread Julien Nioche
Hi Eran,

There is currently no time limit implemented in the Fetcher. We implemented
one which worked quite well in combination with another mechanism which
clears the URLs from a pool if more than x successive exceptions have been
encountered. This limits cases where a site or domain is not responsive.

I might try and submit a patch if I find the time next week, our code has
been heavily modified with the previous patches which have not been
committed to the trunk yet (NUTCH-753 / NUTCH-719 / NUTCH-658) so I'd need
to spend a bit of time extracting this specific functionality from the rest.

Best,

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2009/11/21 Eran Zinman zze...@gmail.com

 Hi,

 We've been using Nutch for focused crawling (right now we are crawling
 about
 50 domains).

 We've encountered the long-tail problem - We've set TopN to 100,000 and
 generate.max.per.host to about 1500.

 90% of all domains finish fetching after 30min, and the other 10% takes an
 additional 2.5 hours - making the slowest domain the bottleneck of the
 entire fetch process.

 I've read Ken Krugler document and he's describing the same problem:

 http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/

 I'm wondering - does anyone have a suggestion on what's the best way to
 tackle this issue?

 I think that Ken suggested to limit the fetch time - for example say
 terminate after 1 hour, even if you are not done yet, is that feature
 available in Nutch?

 I will be happy to try and contribute code if required!

 Thanks,
 Eran



Re: Nutch upgrade to Hadoop

2009-11-21 Thread Dennis Kubes
I have created NUTCH-768.  I am in the middle of testing a few thousand 
page crawl for the most recent released version of Hadoop 0.20.1. 
Everything passes unit tests fine and there are no interface breaks. 
Looks like it will be an easy upgrade so far :)


Dennis

Andrzej Bialecki wrote:

Dennis Kubes wrote:
I would like to get a couple things in this release as well.  Let me 
know if you want help with the upgrade.


You mean you want to do the Hadoop upgrade? I won't stand in your way :)



Re: Nutch upgrade to Hadoop

2009-11-21 Thread Andrzej Bialecki

Dennis Kubes wrote:
I have created NUTCH-768.  I am in the middle of testing a few thousand 
page crawl for the most recent released version of Hadoop 0.20.1. 
Everything passes unit tests fine and there are no interface breaks. 
Looks like it will be an easy upgrade so far :)


Great, thanks!

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch upgrade to Hadoop

2009-11-21 Thread Dennis Kubes
Hadoop seems to have a few more configuration files now.  There are some 
questions about which ones to move over.  I think it also might be time 
to upgrade xerces from 2.6 to the current 2.9.1.  I am testing with that 
currently.  I know with the current version there are some (ignorable 
but annoying) errors thrown from configuration.


Dennis

Andrzej Bialecki wrote:

Dennis Kubes wrote:
I have created NUTCH-768.  I am in the middle of testing a few 
thousand page crawl for the most recent released version of Hadoop 
0.20.1. Everything passes unit tests fine and there are no interface 
breaks. Looks like it will be an easy upgrade so far :)


Great, thanks!



AbstractFetchSchedule

2009-11-21 Thread reinhard schwab
there is some piece of code i dont understand

  public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
// pages are never truly GONE - we have to check them from time to time.
// pages with too long fetchInterval are adjusted so that they fit
within
// maximum fetchInterval (segment retention period).
if (datum.getFetchTime() - curTime  (long) maxInterval * 1000) {
  datum.setFetchInterval(maxInterval * 0.9f);
  datum.setFetchTime(curTime);
}
if (datum.getFetchTime()  curTime) {
  return false;   // not time yet
}
return true;
  }

why is the fetch time set here to curTime?
and why is the fetch interval set to maxInterval * 0.9f whithout
checking the current value of fetchInterval?

regards
reinhard