Re: The Future of Nutch

John Martyniak Fri, 13 Mar 2009 17:48:36 -0700

Dennis,

I am with you, I am building a large scale www search engine. Butmight also build a vertical search as well. Aren't the requirementsthe same for building a large scale www search, against building avertical www search, the only thing that seems to change is the scope.

I like the idea of making nutch work with multiple types of crawlers(maybe a crawler pluginkind of thing). I have looked at Droids and itseems interesting.

Regarding the SOLR integration I am not sure that I agree with on thatpoint. As I have considered using the SOLR integration for my WWWindex. And the main reasons are that SOLR seems to have strongersearch engine features at this point, like faceting, collapsing,synonyms, spelling, etc. but Nutch clearly has crawling and processinglarge amounts of data into a index down pat.

Regarding the MapReduce, if it is good enough for Google, then it isgood enough for Nutch.

I think that if you segment Nutch into too many sub projects you losethe flexibility or ability to have a good single solid, scaleablesearch engine.


Just my .02 cents.

-John


On Mar 13, 2009, at 6:19 PM, Dennis Kubes wrote:

With the release of Nutch 1.0 I think it is a good time to begin adiscussion about the future of Nutch. Here are some things toconsider and would love to here everyones views on this
Nutch's original intention was as a large-scale www search engine.That is a very specific goal. Only a few people and organizationsactually use it on that level. (I just happen to be one of them asmost of my work focuses on large scale web search as opposed tovertical search). Many, perhaps most, people using Nutch these daysare either using parts of Nutch, such as the crawler, or aretargeting towards vertical or intranet type search engines. Thiscan be seen in how many people have already started using the Solrintegration features. So while Nutch was originally intended as awww search, IMO most people aren't using it for that purpose.
Since there are different purposes for different users, would it begood to consider moving Nutch to a top level apache project out fromunder the Lucene umbrella? This would then allow the creation ofnutch sub-projects, such as nutch-solr, nutch-hbase. Thoughts?
Many parts of Nutch have also been implemented in other projects.For example, Tika for the parsers, Droids for the Crawler. In begsthe question what is Nutch's core features going forward. When Ithink about search (again my perspective is large scale), I thinkcrawling or acquisition of data, parsing, analysis, indexing,deployment, and searching. I personally think that there is muchroom for improvement in crawling and especially analysis. Nutchshouldn't just be about the shell but also the brains.
And one of the biggest things I see is many newcomers to nutch havea very hard time getting started. Part of this is understandingmapreduce mentality, part is documentation, part is there is only somuch time some of us have to answer questions so some questions gounanswered on the lists. How might this be improved going forward?
Any other thoughts also welcome. Really I want to start adiscussion about where everyone thinks we are with the state ofNutch and its future.
Dennis

Re: The Future of Nutch

Reply via email to