Re: Possible public applications with nutch and hadoop

Matt Kangas Mon, 15 Oct 2007 13:03:47 -0700

Hi Andrzej (and everyone else),

A few weeks ago, I intended to chime in on your "Scoring API issues"thread, but this new thread is perhaps an even better place to speakup. Time to stop lurking and contribute. :)

First, I want echo Stefan Groschupf's comment several months ago thatthe Nutch community is really lucky to have someone like you stillworking on critical issues like scoring. Without your knowledge andhard work, Andrzej, Nutch development would grind to a halt, or atleast be limited to superficial changes (a new plugin every now andthen, etc).

I started following this list in the Nutch 0.6 era. For one month in2005, I considered jumping in to help with anything Doug wanted done,but I quickly realized that Doug's goals and mine were at odds witheach other. Doug always has said he wanted to build an open-sourcecompetitor to Google, and everything in Nutch has always been alignedwith that principle. I, on the other hand, wanted to build a searchstartup. A head-on assault on a successful, established competitor isprobably the fastest way to kill any startup. The path to success isinstead to zig when they zag, innovate where they are not.

Crawling in the same manner as Google is probably a disaster for anystartup. Whole-web crawling is quite tricky & expensive, and Googlehas done such a good job already here that, once your crawl succeeds,how do you provide results that are noticeably better than Google's?Failure to differentiate your product is also a quick path to deathfor a startup.

I can only conclude that the way to succeed as a search startup is toCRAWL DIFFERENTLY. Focus on websites in specific regions, specifictopics, specific data types. Crawl into the corners of websites thatcontain interesting nuggets of data (listings, calendars, etc) thatwon't ever have a high PageRank. Find a data-niche with an audienceyou understand, and hammer away.

Personally, I spent the last two years pursuing this strategy atbusytonight.com. We built an event-search engine using Nutch 0.7 thatcrawled 30k websites in the USA, automatically discovered & extracted~2.5M listings, and indexed ~1M unique listings. These were real-world events that people could go to. Sadly, I cannot show you thissite, because we ran out of funds and were forced to shut the search-driven site down.

I say this only to point out that I care about this space, I thinkthere are fascinating opportunities in this space. But, if you are astartup, you have a finite time-until-death if you don't get a usableproduct fully assembled.

In this regard, I always found Nutch a bit painful to use. The Nutchcrawler is highly streamlined for straight-ahead Google-scalecrawling, but it's not modular enough to be considered a "crawlerconstruction toolkit". This is sad, because what you need to "crawldifferently" is just such a toolkit. Every search startup must picksome unique crawling+ranking strategy, something they think will digup their audience's desired data as cheaply as possible, and thenimplement it quickly.

At BusyTonight, we integrated a feature-detector into the crawler(date patterns, in our case), then added a site-whitelist filter anda crawl-depth tracker so we could crawl into calendar CGIs but nothave an infinite crawl.

These are the kind of things that I think any content-focused searchstartup would have to add themselves to Nutch. My particularimplementation wouldn't be much help to the average startup, but justhaving some hooks available to plug this stuff in would make a worldof difference. (We had to patch Nutch 0.7 a lot more than I had hoped.)

Since I started with Nutch 0.7, several things have been added thatwould have made my life easier, such as:

* crawl metadata (thank you Stefan)
* the scoring API (thank you Andrzej)

* the concept of multiple-Parses per HTML page introduced with theRSS parsing plugin (thank you Dennis, I think?).

But there are still so many things missing, like a simple place tohang a feature-detector, or a way to influence the direction of thecrawl based on features found. Or a depth-filter so you can crawlinto listings w/o infinite crawls. Etc.

Ultimately, I believe that what is good for startups is good for theNutch community overall. There isn't as much activity on this list asI recall from the Nutch 0.7/0.8 era, and I think that's because somepeople participating then were trying to build startups (like Stefanand myself) and needed to get things done on a deadline.

If you bet on Nutch as your foundation but cannot build adifferentiated product quickly, you'll be screwed, and you will dropout of the Nutch community and move on. Nutch will lose a possibly-valuable contributor.

What is good for search startups is also good for the Nutchcommunity. And what is good for search startups, IMO, is a flexiblecrawling toolbox. +1 to any patch that helps turn the Nutch crawlerinto a more flexible crawling toolkit.


Sincerely,
--Matt Kangas


On Oct 15, 2007, at 6:00 AM, Andrzej Bialecki wrote:

Berlin Brown wrote:
Yea, you are right.  You have to have a constrained set of domains to
search and to be honest, that works pretty well.  The only thing, I
still get a lot of junk links.  I would say that 30% are valid or
interesting links while the other is kind of worthless. I guessit is
a matter of studying spam filters and removing that but I have been
kind of lazy in doing so.
http://botspiritcompany.com/botlist/spring/search/global_search.html?query=bush&querymode=enabled
I have already built a site that I am describing, based on a short
list of popular domains using the very basic aspects of nutch.   You
can search above and see what you think.  I had about 100k links with
my last crawl.
There are quite a few companies (that I know of), who maintainindexes between 50-300 mln pages. All of them implemented their ownstrategy (specific to their needs) to solve this issue.
It's true that if you start crawling without any constraints, veryquickly (~20-30 full cycles) your crawldb will contain 90% of junk,porn and spam. Some strategies to fight this are based on contentanalysis (detection of porn-related content), url analysis(presence of certain patterns in urls), and link analysis (analysisof link neighborhood). There's a lot of research papers on thesesubjects, and many strategies can be implemented as Nutch plugins.
--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


--
Matt Kangas / [EMAIL PROTECTED]

Re: Possible public applications with nutch and hadoop

Reply via email to