Re: Possible public applications with nutch and hadoop

Andrzej Bialecki Tue, 16 Oct 2007 10:12:27 -0700

Matt Kangas wrote:

Hi Andrzej (and everyone else),
A few weeks ago, I intended to chime in on your "Scoring API issues"thread, but this new thread is perhaps an even better place to speak up.Time to stop lurking and contribute. :)

Thanks a lot for sharing your thoughts. Your post touches a fewimportant issues ... I hope other lurkers on the lists will pipe in withtheir feedback!

First, I want echo Stefan Groschupf's comment several months ago thatthe Nutch community is really lucky to have someone like you stillworking on critical issues like scoring. Without your knowledge and hardwork, Andrzej, Nutch development would grind to a halt, or at least belimited to superficial changes (a new plugin every now and then, etc).

That's very kind of you, but a lot of the code has been eithercontributed or co-developed with others, or based on the input from thecommunity. Thankfully, Nutch is still a community effort .. ;)

I wasn't able to contribute as much recently as in the past, for variousreasons (next week I'm moving with my wife and 2 kids to another city,and this involved a lot of preparations ...) - this situation shouldimprove around November, and I should be able to propose and implementsome serious changes in Nutch that I've been mulling over.

I started following this list in the Nutch 0.6 era. For one month in2005, I considered jumping in to help with anything Doug wanted done,but I quickly realized that Doug's goals and mine were at odds with eachother. Doug always has said he wanted to build an open-source competitorto Google, and everything in Nutch has always been aligned with thatprinciple. I, on the other hand, wanted to build a search startup. Ahead-on assault on a successful, established competitor is probably thefastest way to kill any startup. The path to success is instead to zigwhen they zag, innovate where they are not.
Crawling in the same manner as Google is probably a disaster for anystartup.

This is _very_ true. All wannabe SE operators should mark well yourwords. I've personally participated in 2 such attempts, and both failedmiserably - mainly for business- and quality-related reasons. That's howI know what kind of content you get from running an unconstrained crawl.. ;)

Any successful venture in this area (that I know of) had each some kindof strong focus - either on specific search functionality, orinformation domain, or combined search in a novel way with other content.

In this regard, I always found Nutch a bit painful to use. The Nutchcrawler is highly streamlined for straight-ahead Google-scale crawling,but it's not modular enough to be considered a "crawler constructiontoolkit". This is sad, because what you need to "crawl differently" isjust such a toolkit. Every search startup must pick some uniquecrawling+ranking strategy, something they think will dig up theiraudience's desired data as cheaply as possible, and then implement itquickly.

[...]

In your opinion, what is missing in Nutch to support such use?Single-node operation, management UI, modularity, simplified ranking, xyz ?

BTW. regarding the ranking / scoring issues - I re-implemented thescoring algorithm that we used in 0.7, based on the in-degree andout-degree - there are quite a few research papers that claim it'sroughly equivalent to PageRank (in the absence of link spamming ;)),with one huge advantage over the current OPIC - it's easy to ensure thatscores are stable for a given linkgraph, which is not the case with ourOPIC-like scoring (and our implementation of OPIC is not easy to fix).I'll submit a JIRA issue shortly with the patch.

But there are still so many things missing, like a simple place to hanga feature-detector, or a way to influence the direction of the crawlbased on features found. Or a depth-filter so you can crawl intolistings w/o infinite crawls. Etc.


Incidentally, I have developed both of these, for different customers.

The feature detector required adding an extension point (ParseFilter),which takes Content and ParseData as arguments, and puts some additionalmetadata in them- it is a generalization of the HtmlParseFilter concept,only it works for any type of content. We could also add aContentFilter, to pre-process raw content before it's passed to parsers.

The depth filter - it's easy to implement it, perhaps we could add it tothe out-of-the-box Nutch ... in my case it consisted of a ScoringFilter,which increased a counter in CrawlDatum that counted the number of hopsfrom the initial seed. In case of pages discovered via multiple pathsfrom more than one seed, a minimum value would be taken. All outlinkswere processed and stored in crawldb/linkdb, but generator would skippages where the counter was too high.

What is good for search startups is also good for the Nutch community.And what is good for search startups, IMO, is a flexible crawlingtoolbox. +1 to any patch that helps turn the Nutch crawler into a moreflexible crawling toolkit.

Let's continue the discussion - it's important to decide upon strategicdirection for Nutch, and feedback such as yours helps to set it so thatthe project answers common needs of the community, instead of being apurely academic exercise.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Possible public applications with nutch and hadoop

Reply via email to