RE: nutch vs. LCF for web crawling

karl.wright Thu, 10 Jun 2010 16:24:03 -0700

Oh - another big area of difference is that nutch is not an incremental crawler.
Karl

From: Wright Karl (Nokia-S/Cambridge)
Sent: Thursday, June 10, 2010 2:00 PM
To: [email protected]
Subject: RE: nutch vs. LCF for web crawling

Hi Jack,

Nutch research sounds like a perfect project for you to tackle.

AFAIK, there are no missing LCF *features*, but of course there will be 
differences in (for example) how well the crawler recognizes and extracts links 
from content.  For instance, LCF does not extract links from anything other 
than html, xml, or text documents.  I do not know Nutch's behavior here.

In my reading of nutch, the big differences have to do with architecture - 
nutch is potentially distributed, running on Hadoop, and does not use an ACID 
database for its queue - and, as far as targeted audience is concerned, nutch 
is more of a toolkit than an interactive user-friendly crawler.   But that 
evaluation is based mainly on a relatively light and quick analysis of today's 
nutch.

FWIW, as I said before, MetaCarta does a number of performance tests in-house, 
many of which include RSS and Web connectors.  The emphasis of that testing is 
to be sure LCF is crawling as fast as the specified throttling parameters will 
allow.  You should not make the mistake of trying to compare raw throughput 
with throughput in a realistic throttling scenario.  Any attempt to crawl any 
given external site at the maximum code rate will almost certainly get you cut 
off by that site's sysadmin in short order, so throttling is utterly essential 
in the real world, and the "realistic" maximum throughput is directly related 
to the number of individual domains you are trying to crawl.  At MetaCarta we 
test with some 10,000 domains in one of our internal tests, which is much more 
than most of our users will ever do, and the crawler still performs within 20% 
of maximum theoretical throughput.

My larger point is that before you ask for metrics, you really need to think 
through the test cases you are interested in.  A single raw number is not going 
to help you here.

Karl

From: ext Jack Krupansky [mailto:[email protected]]
Sent: Thursday, June 10, 2010 1:41 PM
To: [email protected]
Subject: nutch vs. LCF for web crawling

It would be nice to have a brief summary comparison of the web crawling 
features of LCF relative to nutch. I personally don't know the details of nutch 
other than a quick read of the tutorial, but I am wondering whether there are 
any features of nutch web crawling that may not be available in the LCF web 
crawl connector.

A second question is whether nutch has any performance or volume advantage over 
LCF for web crawling, in a general, rough sense, although some specific 
performance tests for LCF will eventually be good to have.

I would envision people using LCF to crawl desired web sites rather than the 
whole web, but the number of desired sites to be crawled could still be a 
moderately large number. At some point we should publish some guidelines as to 
what amount of web crawling LCF is targeted to support, in a general, rough 
sense.

(Answers could go in the LCF FAQ.)

Thanks.

-- Jack Krupansky

RE: nutch vs. LCF for web crawling

Reply via email to