Matt Kangas wrote:
Hi Andrzej (and everyone else),
A few weeks ago, I intended to chime in on your "Scoring API issues"
thread, but this new thread is perhaps an even better place to speak up.
Time to stop lurking and contribute. :)
Thanks a lot for sharing your thoughts. Your post touches a few
important issues ... I hope other lurkers on the lists will pipe in with
their feedback!
First, I want echo Stefan Groschupf's comment several months ago that
the Nutch community is really lucky to have someone like you still
working on critical issues like scoring. Without your knowledge and hard
work, Andrzej, Nutch development would grind to a halt, or at least be
limited to superficial changes (a new plugin every now and then, etc).
That's very kind of you, but a lot of the code has been either
contributed or co-developed with others, or based on the input from the
community. Thankfully, Nutch is still a community effort .. ;)
I wasn't able to contribute as much recently as in the past, for various
reasons (next week I'm moving with my wife and 2 kids to another city,
and this involved a lot of preparations ...) - this situation should
improve around November, and I should be able to propose and implement
some serious changes in Nutch that I've been mulling over.
I started following this list in the Nutch 0.6 era. For one month in
2005, I considered jumping in to help with anything Doug wanted done,
but I quickly realized that Doug's goals and mine were at odds with each
other. Doug always has said he wanted to build an open-source competitor
to Google, and everything in Nutch has always been aligned with that
principle. I, on the other hand, wanted to build a search startup. A
head-on assault on a successful, established competitor is probably the
fastest way to kill any startup. The path to success is instead to zig
when they zag, innovate where they are not.
Crawling in the same manner as Google is probably a disaster for any
startup.
This is _very_ true. All wannabe SE operators should mark well your
words. I've personally participated in 2 such attempts, and both failed
miserably - mainly for business- and quality-related reasons. That's how
I know what kind of content you get from running an unconstrained crawl
.. ;)
Any successful venture in this area (that I know of) had each some kind
of strong focus - either on specific search functionality, or
information domain, or combined search in a novel way with other content.
In this regard, I always found Nutch a bit painful to use. The Nutch
crawler is highly streamlined for straight-ahead Google-scale crawling,
but it's not modular enough to be considered a "crawler construction
toolkit". This is sad, because what you need to "crawl differently" is
just such a toolkit. Every search startup must pick some unique
crawling+ranking strategy, something they think will dig up their
audience's desired data as cheaply as possible, and then implement it
quickly.
[...]
In your opinion, what is missing in Nutch to support such use?
Single-node operation, management UI, modularity, simplified ranking, xyz ?
BTW. regarding the ranking / scoring issues - I re-implemented the
scoring algorithm that we used in 0.7, based on the in-degree and
out-degree - there are quite a few research papers that claim it's
roughly equivalent to PageRank (in the absence of link spamming ;)),
with one huge advantage over the current OPIC - it's easy to ensure that
scores are stable for a given linkgraph, which is not the case with our
OPIC-like scoring (and our implementation of OPIC is not easy to fix).
I'll submit a JIRA issue shortly with the patch.
But there are still so many things missing, like a simple place to hang
a feature-detector, or a way to influence the direction of the crawl
based on features found. Or a depth-filter so you can crawl into
listings w/o infinite crawls. Etc.
Incidentally, I have developed both of these, for different customers.
The feature detector required adding an extension point (ParseFilter),
which takes Content and ParseData as arguments, and puts some additional
metadata in them- it is a generalization of the HtmlParseFilter concept,
only it works for any type of content. We could also add a
ContentFilter, to pre-process raw content before it's passed to parsers.
The depth filter - it's easy to implement it, perhaps we could add it to
the out-of-the-box Nutch ... in my case it consisted of a ScoringFilter,
which increased a counter in CrawlDatum that counted the number of hops
from the initial seed. In case of pages discovered via multiple paths
from more than one seed, a minimum value would be taken. All outlinks
were processed and stored in crawldb/linkdb, but generator would skip
pages where the counter was too high.
What is good for search startups is also good for the Nutch community.
And what is good for search startups, IMO, is a flexible crawling
toolbox. +1 to any patch that helps turn the Nutch crawler into a more
flexible crawling toolkit.
Let's continue the discussion - it's important to decide upon strategic
direction for Nutch, and feedback such as yours helps to set it so that
the project answers common needs of the community, instead of being a
purely academic exercise.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com