Hi Andrzej (and everyone else),
A few weeks ago, I intended to chime in on your "Scoring API issues"
thread, but this new thread is perhaps an even better place to speak
up. Time to stop lurking and contribute. :)
First, I want echo Stefan Groschupf's comment several months ago that
the Nutch community is really lucky to have someone like you still
working on critical issues like scoring. Without your knowledge and
hard work, Andrzej, Nutch development would grind to a halt, or at
least be limited to superficial changes (a new plugin every now and
then, etc).
I started following this list in the Nutch 0.6 era. For one month in
2005, I considered jumping in to help with anything Doug wanted done,
but I quickly realized that Doug's goals and mine were at odds with
each other. Doug always has said he wanted to build an open-source
competitor to Google, and everything in Nutch has always been aligned
with that principle. I, on the other hand, wanted to build a search
startup. A head-on assault on a successful, established competitor is
probably the fastest way to kill any startup. The path to success is
instead to zig when they zag, innovate where they are not.
Crawling in the same manner as Google is probably a disaster for any
startup. Whole-web crawling is quite tricky & expensive, and Google
has done such a good job already here that, once your crawl succeeds,
how do you provide results that are noticeably better than Google's?
Failure to differentiate your product is also a quick path to death
for a startup.
I can only conclude that the way to succeed as a search startup is to
CRAWL DIFFERENTLY. Focus on websites in specific regions, specific
topics, specific data types. Crawl into the corners of websites that
contain interesting nuggets of data (listings, calendars, etc) that
won't ever have a high PageRank. Find a data-niche with an audience
you understand, and hammer away.
Personally, I spent the last two years pursuing this strategy at
busytonight.com. We built an event-search engine using Nutch 0.7 that
crawled 30k websites in the USA, automatically discovered & extracted
~2.5M listings, and indexed ~1M unique listings. These were real-
world events that people could go to. Sadly, I cannot show you this
site, because we ran out of funds and were forced to shut the search-
driven site down.
I say this only to point out that I care about this space, I think
there are fascinating opportunities in this space. But, if you are a
startup, you have a finite time-until-death if you don't get a usable
product fully assembled.
In this regard, I always found Nutch a bit painful to use. The Nutch
crawler is highly streamlined for straight-ahead Google-scale
crawling, but it's not modular enough to be considered a "crawler
construction toolkit". This is sad, because what you need to "crawl
differently" is just such a toolkit. Every search startup must pick
some unique crawling+ranking strategy, something they think will dig
up their audience's desired data as cheaply as possible, and then
implement it quickly.
At BusyTonight, we integrated a feature-detector into the crawler
(date patterns, in our case), then added a site-whitelist filter and
a crawl-depth tracker so we could crawl into calendar CGIs but not
have an infinite crawl.
These are the kind of things that I think any content-focused search
startup would have to add themselves to Nutch. My particular
implementation wouldn't be much help to the average startup, but just
having some hooks available to plug this stuff in would make a world
of difference. (We had to patch Nutch 0.7 a lot more than I had hoped.)
Since I started with Nutch 0.7, several things have been added that
would have made my life easier, such as:
* crawl metadata (thank you Stefan)
* the scoring API (thank you Andrzej)
* the concept of multiple-Parses per HTML page introduced with the
RSS parsing plugin (thank you Dennis, I think?).
But there are still so many things missing, like a simple place to
hang a feature-detector, or a way to influence the direction of the
crawl based on features found. Or a depth-filter so you can crawl
into listings w/o infinite crawls. Etc.
Ultimately, I believe that what is good for startups is good for the
Nutch community overall. There isn't as much activity on this list as
I recall from the Nutch 0.7/0.8 era, and I think that's because some
people participating then were trying to build startups (like Stefan
and myself) and needed to get things done on a deadline.
If you bet on Nutch as your foundation but cannot build a
differentiated product quickly, you'll be screwed, and you will drop
out of the Nutch community and move on. Nutch will lose a possibly-
valuable contributor.
What is good for search startups is also good for the Nutch
community. And what is good for search startups, IMO, is a flexible
crawling toolbox. +1 to any patch that helps turn the Nutch crawler
into a more flexible crawling toolkit.
Sincerely,
--Matt Kangas
On Oct 15, 2007, at 6:00 AM, Andrzej Bialecki wrote:
Berlin Brown wrote:
Yea, you are right. You have to have a constrained set of domains to
search and to be honest, that works pretty well. The only thing, I
still get a lot of junk links. I would say that 30% are valid or
interesting links while the other is kind of worthless. I guess
it is
a matter of studying spam filters and removing that but I have been
kind of lazy in doing so.
http://botspiritcompany.com/botlist/spring/search/
global_search.html?query=bush&querymode=enabled
I have already built a site that I am describing, based on a short
list of popular domains using the very basic aspects of nutch. You
can search above and see what you think. I had about 100k links with
my last crawl.
There are quite a few companies (that I know of), who maintain
indexes between 50-300 mln pages. All of them implemented their own
strategy (specific to their needs) to solve this issue.
It's true that if you start crawling without any constraints, very
quickly (~20-30 full cycles) your crawldb will contain 90% of junk,
porn and spam. Some strategies to fight this are based on content
analysis (detection of porn-related content), url analysis
(presence of certain patterns in urls), and link analysis (analysis
of link neighborhood). There's a lot of research papers on these
subjects, and many strategies can be implemented as Nutch plugins.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
Matt Kangas / [EMAIL PROTECTED]