Matt Kangas wrote:
Hi Andrzej (and everyone else),

A few weeks ago, I intended to chime in on your "Scoring API issues" thread, but this new thread is perhaps an even better place to speak up. Time to stop lurking and contribute. :)

Thanks a lot for sharing your thoughts. Your post touches a few important issues ... I hope other lurkers on the lists will pipe in with their feedback!


First, I want echo Stefan Groschupf's comment several months ago that the Nutch community is really lucky to have someone like you still working on critical issues like scoring. Without your knowledge and hard work, Andrzej, Nutch development would grind to a halt, or at least be limited to superficial changes (a new plugin every now and then, etc).

That's very kind of you, but a lot of the code has been either contributed or co-developed with others, or based on the input from the community. Thankfully, Nutch is still a community effort .. ;)

I wasn't able to contribute as much recently as in the past, for various reasons (next week I'm moving with my wife and 2 kids to another city, and this involved a lot of preparations ...) - this situation should improve around November, and I should be able to propose and implement some serious changes in Nutch that I've been mulling over.


I started following this list in the Nutch 0.6 era. For one month in 2005, I considered jumping in to help with anything Doug wanted done, but I quickly realized that Doug's goals and mine were at odds with each other. Doug always has said he wanted to build an open-source competitor to Google, and everything in Nutch has always been aligned with that principle. I, on the other hand, wanted to build a search startup. A head-on assault on a successful, established competitor is probably the fastest way to kill any startup. The path to success is instead to zig when they zag, innovate where they are not.

Crawling in the same manner as Google is probably a disaster for any startup.

This is _very_ true. All wannabe SE operators should mark well your words. I've personally participated in 2 such attempts, and both failed miserably - mainly for business- and quality-related reasons. That's how I know what kind of content you get from running an unconstrained crawl .. ;)

Any successful venture in this area (that I know of) had each some kind of strong focus - either on specific search functionality, or information domain, or combined search in a novel way with other content.


In this regard, I always found Nutch a bit painful to use. The Nutch crawler is highly streamlined for straight-ahead Google-scale crawling, but it's not modular enough to be considered a "crawler construction toolkit". This is sad, because what you need to "crawl differently" is just such a toolkit. Every search startup must pick some unique crawling+ranking strategy, something they think will dig up their audience's desired data as cheaply as possible, and then implement it quickly.
[...]

In your opinion, what is missing in Nutch to support such use? Single-node operation, management UI, modularity, simplified ranking, xyz ?

BTW. regarding the ranking / scoring issues - I re-implemented the scoring algorithm that we used in 0.7, based on the in-degree and out-degree - there are quite a few research papers that claim it's roughly equivalent to PageRank (in the absence of link spamming ;)), with one huge advantage over the current OPIC - it's easy to ensure that scores are stable for a given linkgraph, which is not the case with our OPIC-like scoring (and our implementation of OPIC is not easy to fix). I'll submit a JIRA issue shortly with the patch.

But there are still so many things missing, like a simple place to hang a feature-detector, or a way to influence the direction of the crawl based on features found. Or a depth-filter so you can crawl into listings w/o infinite crawls. Etc.

Incidentally, I have developed both of these, for different customers.

The feature detector required adding an extension point (ParseFilter), which takes Content and ParseData as arguments, and puts some additional metadata in them- it is a generalization of the HtmlParseFilter concept, only it works for any type of content. We could also add a ContentFilter, to pre-process raw content before it's passed to parsers.

The depth filter - it's easy to implement it, perhaps we could add it to the out-of-the-box Nutch ... in my case it consisted of a ScoringFilter, which increased a counter in CrawlDatum that counted the number of hops from the initial seed. In case of pages discovered via multiple paths from more than one seed, a minimum value would be taken. All outlinks were processed and stored in crawldb/linkdb, but generator would skip pages where the counter was too high.



What is good for search startups is also good for the Nutch community. And what is good for search startups, IMO, is a flexible crawling toolbox. +1 to any patch that helps turn the Nutch crawler into a more flexible crawling toolkit.

Let's continue the discussion - it's important to decide upon strategic direction for Nutch, and feedback such as yours helps to set it so that the project answers common needs of the community, instead of being a purely academic exercise.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to