Hi Andrzej (and everyone else),

A few weeks ago, I intended to chime in on your "Scoring API issues" thread, but this new thread is perhaps an even better place to speak up. Time to stop lurking and contribute. :)

First, I want echo Stefan Groschupf's comment several months ago that the Nutch community is really lucky to have someone like you still working on critical issues like scoring. Without your knowledge and hard work, Andrzej, Nutch development would grind to a halt, or at least be limited to superficial changes (a new plugin every now and then, etc).

I started following this list in the Nutch 0.6 era. For one month in 2005, I considered jumping in to help with anything Doug wanted done, but I quickly realized that Doug's goals and mine were at odds with each other. Doug always has said he wanted to build an open-source competitor to Google, and everything in Nutch has always been aligned with that principle. I, on the other hand, wanted to build a search startup. A head-on assault on a successful, established competitor is probably the fastest way to kill any startup. The path to success is instead to zig when they zag, innovate where they are not.

Crawling in the same manner as Google is probably a disaster for any startup. Whole-web crawling is quite tricky & expensive, and Google has done such a good job already here that, once your crawl succeeds, how do you provide results that are noticeably better than Google's? Failure to differentiate your product is also a quick path to death for a startup.

I can only conclude that the way to succeed as a search startup is to CRAWL DIFFERENTLY. Focus on websites in specific regions, specific topics, specific data types. Crawl into the corners of websites that contain interesting nuggets of data (listings, calendars, etc) that won't ever have a high PageRank. Find a data-niche with an audience you understand, and hammer away.

Personally, I spent the last two years pursuing this strategy at busytonight.com. We built an event-search engine using Nutch 0.7 that crawled 30k websites in the USA, automatically discovered & extracted ~2.5M listings, and indexed ~1M unique listings. These were real- world events that people could go to. Sadly, I cannot show you this site, because we ran out of funds and were forced to shut the search- driven site down.

I say this only to point out that I care about this space, I think there are fascinating opportunities in this space. But, if you are a startup, you have a finite time-until-death if you don't get a usable product fully assembled.

In this regard, I always found Nutch a bit painful to use. The Nutch crawler is highly streamlined for straight-ahead Google-scale crawling, but it's not modular enough to be considered a "crawler construction toolkit". This is sad, because what you need to "crawl differently" is just such a toolkit. Every search startup must pick some unique crawling+ranking strategy, something they think will dig up their audience's desired data as cheaply as possible, and then implement it quickly.

At BusyTonight, we integrated a feature-detector into the crawler (date patterns, in our case), then added a site-whitelist filter and a crawl-depth tracker so we could crawl into calendar CGIs but not have an infinite crawl.

These are the kind of things that I think any content-focused search startup would have to add themselves to Nutch. My particular implementation wouldn't be much help to the average startup, but just having some hooks available to plug this stuff in would make a world of difference. (We had to patch Nutch 0.7 a lot more than I had hoped.)

Since I started with Nutch 0.7, several things have been added that would have made my life easier, such as:
* crawl metadata (thank you Stefan)
* the scoring API (thank you Andrzej)
* the concept of multiple-Parses per HTML page introduced with the RSS parsing plugin (thank you Dennis, I think?).

But there are still so many things missing, like a simple place to hang a feature-detector, or a way to influence the direction of the crawl based on features found. Or a depth-filter so you can crawl into listings w/o infinite crawls. Etc.

Ultimately, I believe that what is good for startups is good for the Nutch community overall. There isn't as much activity on this list as I recall from the Nutch 0.7/0.8 era, and I think that's because some people participating then were trying to build startups (like Stefan and myself) and needed to get things done on a deadline.

If you bet on Nutch as your foundation but cannot build a differentiated product quickly, you'll be screwed, and you will drop out of the Nutch community and move on. Nutch will lose a possibly- valuable contributor.

What is good for search startups is also good for the Nutch community. And what is good for search startups, IMO, is a flexible crawling toolbox. +1 to any patch that helps turn the Nutch crawler into a more flexible crawling toolkit.

Sincerely,
--Matt Kangas


On Oct 15, 2007, at 6:00 AM, Andrzej Bialecki wrote:

Berlin Brown wrote:
Yea, you are right.  You have to have a constrained set of domains to
search and to be honest, that works pretty well.  The only thing, I
still get a lot of junk links.  I would say that 30% are valid or
interesting links while the other is kind of worthless. I guess it is
a matter of studying spam filters and removing that but I have been
kind of lazy in doing so.
http://botspiritcompany.com/botlist/spring/search/ global_search.html?query=bush&querymode=enabled
I have already built a site that I am describing, based on a short
list of popular domains using the very basic aspects of nutch.   You
can search above and see what you think.  I had about 100k links with
my last crawl.

There are quite a few companies (that I know of), who maintain indexes between 50-300 mln pages. All of them implemented their own strategy (specific to their needs) to solve this issue.

It's true that if you start crawling without any constraints, very quickly (~20-30 full cycles) your crawldb will contain 90% of junk, porn and spam. Some strategies to fight this are based on content analysis (detection of porn-related content), url analysis (presence of certain patterns in urls), and link analysis (analysis of link neighborhood). There's a lot of research papers on these subjects, and many strategies can be implemented as Nutch plugins.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


--
Matt Kangas / [EMAIL PROTECTED]


Reply via email to