Andrzej, before I dive into your specific questions... I want to step
back to the original topic: what applications are possible with Nutch?
The specialization that I focused on was a _listings_ crawler. There
are any number of listings types that one could potentially crawl:
- events (what i focused on)
- job postings
- news
- any category that you might find on either craigslist or ebay
One caveat about listings as a data-type: most traditional listing
aggregators (newspapers, etc) employ editors to acquire their
listings, usually via costly methods. They will not be happy if you
business model is based on scraping their content. (Hence Google
News' ongoing fights with the Associated Press.) If you build a
listings search startup, it's a good idea to get your listings
directly from the listing-creator, not a middleman.
Since I wanted to crawl HTML pages and index _listings_ (0..n per
HTML page), my system outline looks like this:
- customized Nutch 0.7 crawler
- custom segment reader, running feature-detector
- feedback into crawl_db
- crawl segments not used beyond this point.
- HTML+feature markup fed into an extraction pipeline
- individual event listings written to a listing_db (disk)
- synthetic "crawler" to traverse listing_db, creates new "segments"
with one record/listing
- Nutch indexer (w. custom plugins) creates Lucene index
- custom servlets to return XML; PHP front-end turns XML results into
HTML
That's a lot of custom stuff. Shame that any listing-oriented startup
would have to go through that whole process.
If Solr ever displaces the NutchBean/SearchServlet, that will
eliminate one step. The bean+servlet is nearly useless for any
startup, because you'll have to hack them beyond recognition to
implement a distinctive UI. And, as I mentioned before, creating a
distinctive product is essential to your startup's survival.
I'm using Solr now on a different project and I love it. Wish I had
it two years ago. Now, about the crawler...
----( back to Andrzej's question )----
On Oct 16, 2007, at 1:10 PM, Andrzej Bialecki wrote:
Matt Kangas wrote:
In this regard, I always found Nutch a bit painful to use. The
Nutch crawler is highly streamlined for straight-ahead Google-
scale crawling, but it's not modular enough to be considered a
"crawler construction toolkit". This is sad, because what you need
to "crawl differently" is just such a toolkit. Every search
startup must pick some unique crawling+ranking strategy, something
they think will dig up their audience's desired data as cheaply as
possible, and then implement it quickly.
[...]
In your opinion, what is missing in Nutch to support such use?
Single-node operation, management UI, modularity, simplified
ranking, xyz ?
Modularity, most definitely.
Consider this scenario: you want to crawl into CGIs, but not all CGIs
will have content of interest. Example: a site has an events calendar
and a message board. Crawling the message board is a huge waste of
bandwidth and cpu. Obviously, you'd like to avoid it. (Same holds if
you're crawling an auto-dealer's site for car listings, etc.)
If you are feature-detecting pages, one solution is to have a shallow
default depth-limit, then increase this limit when "interesting"
content is found. To cut off the crawl on dead-ends, you want
"updatedb" to pay attention to a (new) parse-status.
We implemented this. If I'm remembering this correctly... the problem
is, in Nutch 0.7, parse-status isn't something that can force the
termination of a crawl-path. Only fetch-status is considered.
Looking at Nutch 0.7.2 to refresh my memory, I see in
Fetcher.FetcherThread:
- "public run()" calls "private handleFetch()" once the content is
acquired
- if "Fetcher.parsing" is true, and parse-status isSuccess, then:
- "private outputPage()" is called with (FetcherOutput, Content,
ParseText, ParseData)
"handleFetch()" is what I want to tweak, but it's private. So I have
to skip parsing here, and implement a custom segment-processor (and
updatedb) step.
This is a decision-making step in the crawler that can't be easily
overridden by the user. It's an obstacle to "crawling differently".
What a startup needs, IMO, is a "crawler construction toolkit":
something that ships in a sane default configuration, but where
nearly any operation can be swapped out or overridden.
-------
Some ideas for making the crawler more flexible:
1) Rewrite Fetcher as a "framework", ala Spring, WebObjects, or Rails
(I think). Every conceivable step has its own, non-private method. To
customize its behavior, subclass and override the steps of interest.
- This is my strawman solution. :) I know Doug is strongly against
using OO in this manner.
2) Make lots and lots of extension points. Fetcher.java thus becomes
nothing more than a sequence of extension-point calls.
- It could work, but... seems like a mess. Diagnosing config errors
is bad enough already, this makes it worse.
3) Mimic Tomcat or Jetty's design, both of which are "toolkits for
web server construction".
- Config errors here are hard to diagnose here, too. Need a tool to
sanity-check a crawler config at startup, or... ?
All things considered, I'd rather bake a crawler configuration into
a .java file than a .xml. This way the compiler can (hopefully)
validate my crawl-configuration, instead of sifting through
ExtensionPoint/Plugin exceptions at runtime.
Ok, I've been typing for too long now. Time to pass the thread to the
next person. :)
--Matt
--
Matt Kangas / [EMAIL PROTECTED]