Andrzej, before I dive into your specific questions... I want to step back to the original topic: what applications are possible with Nutch?

The specialization that I focused on was a _listings_ crawler. There are any number of listings types that one could potentially crawl:

- events (what i focused on)
- job postings
- news
- any category that you might find on either craigslist or ebay

One caveat about listings as a data-type: most traditional listing aggregators (newspapers, etc) employ editors to acquire their listings, usually via costly methods. They will not be happy if you business model is based on scraping their content. (Hence Google News' ongoing fights with the Associated Press.) If you build a listings search startup, it's a good idea to get your listings directly from the listing-creator, not a middleman.

Since I wanted to crawl HTML pages and index _listings_ (0..n per HTML page), my system outline looks like this:

- customized Nutch 0.7 crawler
  - custom segment reader, running feature-detector
  - feedback into crawl_db
  - crawl segments not used beyond this point.
- HTML+feature markup fed into an extraction pipeline
- individual event listings written to a listing_db (disk)
- synthetic "crawler" to traverse listing_db, creates new "segments" with one record/listing
- Nutch indexer (w. custom plugins) creates Lucene index
- custom servlets to return XML; PHP front-end turns XML results into HTML

That's a lot of custom stuff. Shame that any listing-oriented startup would have to go through that whole process.

If Solr ever displaces the NutchBean/SearchServlet, that will eliminate one step. The bean+servlet is nearly useless for any startup, because you'll have to hack them beyond recognition to implement a distinctive UI. And, as I mentioned before, creating a distinctive product is essential to your startup's survival.

I'm using Solr now on a different project and I love it. Wish I had it two years ago. Now, about the crawler...

----( back to Andrzej's question )----

On Oct 16, 2007, at 1:10 PM, Andrzej Bialecki wrote:
Matt Kangas wrote:
In this regard, I always found Nutch a bit painful to use. The Nutch crawler is highly streamlined for straight-ahead Google- scale crawling, but it's not modular enough to be considered a "crawler construction toolkit". This is sad, because what you need to "crawl differently" is just such a toolkit. Every search startup must pick some unique crawling+ranking strategy, something they think will dig up their audience's desired data as cheaply as possible, and then implement it quickly.
[...]

In your opinion, what is missing in Nutch to support such use? Single-node operation, management UI, modularity, simplified ranking, xyz ?

Modularity, most definitely.

Consider this scenario: you want to crawl into CGIs, but not all CGIs will have content of interest. Example: a site has an events calendar and a message board. Crawling the message board is a huge waste of bandwidth and cpu. Obviously, you'd like to avoid it. (Same holds if you're crawling an auto-dealer's site for car listings, etc.)

If you are feature-detecting pages, one solution is to have a shallow default depth-limit, then increase this limit when "interesting" content is found. To cut off the crawl on dead-ends, you want "updatedb" to pay attention to a (new) parse-status.

We implemented this. If I'm remembering this correctly... the problem is, in Nutch 0.7, parse-status isn't something that can force the termination of a crawl-path. Only fetch-status is considered.

Looking at Nutch 0.7.2 to refresh my memory, I see in Fetcher.FetcherThread: - "public run()" calls "private handleFetch()" once the content is acquired
- if "Fetcher.parsing" is true, and parse-status isSuccess, then:
- "private outputPage()" is called with (FetcherOutput, Content, ParseText, ParseData)

"handleFetch()" is what I want to tweak, but it's private. So I have to skip parsing here, and implement a custom segment-processor (and updatedb) step.

This is a decision-making step in the crawler that can't be easily overridden by the user. It's an obstacle to "crawling differently". What a startup needs, IMO, is a "crawler construction toolkit": something that ships in a sane default configuration, but where nearly any operation can be swapped out or overridden.

-------

Some ideas for making the crawler more flexible:

1) Rewrite Fetcher as a "framework", ala Spring, WebObjects, or Rails (I think). Every conceivable step has its own, non-private method. To customize its behavior, subclass and override the steps of interest. - This is my strawman solution. :) I know Doug is strongly against using OO in this manner.

2) Make lots and lots of extension points. Fetcher.java thus becomes nothing more than a sequence of extension-point calls. - It could work, but... seems like a mess. Diagnosing config errors is bad enough already, this makes it worse.

3) Mimic Tomcat or Jetty's design, both of which are "toolkits for web server construction". - Config errors here are hard to diagnose here, too. Need a tool to sanity-check a crawler config at startup, or... ?

All things considered, I'd rather bake a crawler configuration into a .java file than a .xml. This way the compiler can (hopefully) validate my crawl-configuration, instead of sifting through ExtensionPoint/Plugin exceptions at runtime.


Ok, I've been typing for too long now. Time to pass the thread to the next person. :)

--Matt

--
Matt Kangas / [EMAIL PROTECTED]


Reply via email to