Re: Possible public applications with nutch and hadoop

Matt Kangas Tue, 16 Oct 2007 21:21:59 -0700

Andrzej, before I dive into your specific questions... I want to stepback to the original topic: what applications are possible with Nutch?

The specialization that I focused on was a _listings_ crawler. Thereare any number of listings types that one could potentially crawl:


- events (what i focused on)
- job postings
- news
- any category that you might find on either craigslist or ebay

One caveat about listings as a data-type: most traditional listingaggregators (newspapers, etc) employ editors to acquire theirlistings, usually via costly methods. They will not be happy if youbusiness model is based on scraping their content. (Hence GoogleNews' ongoing fights with the Associated Press.) If you build alistings search startup, it's a good idea to get your listingsdirectly from the listing-creator, not a middleman.

Since I wanted to crawl HTML pages and index _listings_ (0..n perHTML page), my system outline looks like this:


- customized Nutch 0.7 crawler
  - custom segment reader, running feature-detector
  - feedback into crawl_db
  - crawl segments not used beyond this point.
- HTML+feature markup fed into an extraction pipeline
- individual event listings written to a listing_db (disk)

- synthetic "crawler" to traverse listing_db, creates new "segments"with one record/listing

- Nutch indexer (w. custom plugins) creates Lucene index

- custom servlets to return XML; PHP front-end turns XML results intoHTML

That's a lot of custom stuff. Shame that any listing-oriented startupwould have to go through that whole process.

If Solr ever displaces the NutchBean/SearchServlet, that willeliminate one step. The bean+servlet is nearly useless for anystartup, because you'll have to hack them beyond recognition toimplement a distinctive UI. And, as I mentioned before, creating adistinctive product is essential to your startup's survival.

I'm using Solr now on a different project and I love it. Wish I hadit two years ago. Now, about the crawler...


----( back to Andrzej's question )----

On Oct 16, 2007, at 1:10 PM, Andrzej Bialecki wrote:

Matt Kangas wrote:
In this regard, I always found Nutch a bit painful to use. TheNutch crawler is highly streamlined for straight-ahead Google-scale crawling, but it's not modular enough to be considered a"crawler construction toolkit". This is sad, because what you needto "crawl differently" is just such a toolkit. Every searchstartup must pick some unique crawling+ranking strategy, somethingthey think will dig up their audience's desired data as cheaply aspossible, and then implement it quickly.
[...]
In your opinion, what is missing in Nutch to support such use?Single-node operation, management UI, modularity, simplifiedranking, xyz ?


Modularity, most definitely.

Consider this scenario: you want to crawl into CGIs, but not all CGIswill have content of interest. Example: a site has an events calendarand a message board. Crawling the message board is a huge waste ofbandwidth and cpu. Obviously, you'd like to avoid it. (Same holds ifyou're crawling an auto-dealer's site for car listings, etc.)

If you are feature-detecting pages, one solution is to have a shallowdefault depth-limit, then increase this limit when "interesting"content is found. To cut off the crawl on dead-ends, you want"updatedb" to pay attention to a (new) parse-status.

We implemented this. If I'm remembering this correctly... the problemis, in Nutch 0.7, parse-status isn't something that can force thetermination of a crawl-path. Only fetch-status is considered.

Looking at Nutch 0.7.2 to refresh my memory, I see inFetcher.FetcherThread:- "public run()" calls "private handleFetch()" once the content isacquired

- if "Fetcher.parsing" is true, and parse-status isSuccess, then:

- "private outputPage()" is called with (FetcherOutput, Content,ParseText, ParseData)

"handleFetch()" is what I want to tweak, but it's private. So I haveto skip parsing here, and implement a custom segment-processor (andupdatedb) step.

This is a decision-making step in the crawler that can't be easilyoverridden by the user. It's an obstacle to "crawling differently".What a startup needs, IMO, is a "crawler construction toolkit":something that ships in a sane default configuration, but wherenearly any operation can be swapped out or overridden.


-------

Some ideas for making the crawler more flexible:

1) Rewrite Fetcher as a "framework", ala Spring, WebObjects, or Rails(I think). Every conceivable step has its own, non-private method. Tocustomize its behavior, subclass and override the steps of interest.- This is my strawman solution. :) I know Doug is strongly againstusing OO in this manner.

2) Make lots and lots of extension points. Fetcher.java thus becomesnothing more than a sequence of extension-point calls.- It could work, but... seems like a mess. Diagnosing config errorsis bad enough already, this makes it worse.

3) Mimic Tomcat or Jetty's design, both of which are "toolkits forweb server construction".- Config errors here are hard to diagnose here, too. Need a tool tosanity-check a crawler config at startup, or... ?

All things considered, I'd rather bake a crawler configuration intoa .java file than a .xml. This way the compiler can (hopefully)validate my crawl-configuration, instead of sifting throughExtensionPoint/Plugin exceptions at runtime.

Ok, I've been typing for too long now. Time to pass the thread to thenext person. :)


--Matt

--
Matt Kangas / [EMAIL PROTECTED]

Re: Possible public applications with nutch and hadoop

Reply via email to