Dennis,
I am with you, I am building a large scale www search engine. But
might also build a vertical search as well. Aren't the requirements
the same for building a large scale www search, against building a
vertical www search, the only thing that seems to change is the scope.
I like the idea of making nutch work with multiple types of crawlers
(maybe a crawler pluginkind of thing). I have looked at Droids and it
seems interesting.
Regarding the SOLR integration I am not sure that I agree with on that
point. As I have considered using the SOLR integration for my WWW
index. And the main reasons are that SOLR seems to have stronger
search engine features at this point, like faceting, collapsing,
synonyms, spelling, etc. but Nutch clearly has crawling and processing
large amounts of data into a index down pat.
Regarding the MapReduce, if it is good enough for Google, then it is
good enough for Nutch.
I think that if you segment Nutch into too many sub projects you lose
the flexibility or ability to have a good single solid, scaleable
search engine.
Just my .02 cents.
-John
On Mar 13, 2009, at 6:19 PM, Dennis Kubes wrote:
With the release of Nutch 1.0 I think it is a good time to begin a
discussion about the future of Nutch. Here are some things to
consider and would love to here everyones views on this
Nutch's original intention was as a large-scale www search engine.
That is a very specific goal. Only a few people and organizations
actually use it on that level. (I just happen to be one of them as
most of my work focuses on large scale web search as opposed to
vertical search). Many, perhaps most, people using Nutch these days
are either using parts of Nutch, such as the crawler, or are
targeting towards vertical or intranet type search engines. This
can be seen in how many people have already started using the Solr
integration features. So while Nutch was originally intended as a
www search, IMO most people aren't using it for that purpose.
Since there are different purposes for different users, would it be
good to consider moving Nutch to a top level apache project out from
under the Lucene umbrella? This would then allow the creation of
nutch sub-projects, such as nutch-solr, nutch-hbase. Thoughts?
Many parts of Nutch have also been implemented in other projects.
For example, Tika for the parsers, Droids for the Crawler. In begs
the question what is Nutch's core features going forward. When I
think about search (again my perspective is large scale), I think
crawling or acquisition of data, parsing, analysis, indexing,
deployment, and searching. I personally think that there is much
room for improvement in crawling and especially analysis. Nutch
shouldn't just be about the shell but also the brains.
And one of the biggest things I see is many newcomers to nutch have
a very hard time getting started. Part of this is understanding
mapreduce mentality, part is documentation, part is there is only so
much time some of us have to answer questions so some questions go
unanswered on the lists. How might this be improved going forward?
Any other thoughts also welcome. Really I want to start a
discussion about where everyone thinks we are with the state of
Nutch and its future.
Dennis