+1 for making nutch top level. I would like to see more integrated usage with all the other tools like hadoop/hbase and even support for other crawlers like heritrix.
On Fri, Mar 13, 2009 at 8:19 PM, Dennis Kubes <ku...@apache.org> wrote: > With the release of Nutch 1.0 I think it is a good time to begin a > discussion about the future of Nutch. Here are some things to consider and > would love to here everyones views on this > > Nutch's original intention was as a large-scale www search engine. That is > a very specific goal. Only a few people and organizations actually use it > on that level. (I just happen to be one of them as most of my work focuses > on large scale web search as opposed to vertical search). Many, perhaps > most, people using Nutch these days are either using parts of Nutch, such as > the crawler, or are targeting towards vertical or intranet type search > engines. This can be seen in how many people have already started using the > Solr integration features. So while Nutch was originally intended as a www > search, IMO most people aren't using it for that purpose. > > Since there are different purposes for different users, would it be good to > consider moving Nutch to a top level apache project out from under the > Lucene umbrella? This would then allow the creation of nutch sub-projects, > such as nutch-solr, nutch-hbase. Thoughts? > > Many parts of Nutch have also been implemented in other projects. For > example, Tika for the parsers, Droids for the Crawler. In begs the question > what is Nutch's core features going forward. When I think about search > (again my perspective is large scale), I think crawling or acquisition of > data, parsing, analysis, indexing, deployment, and searching. I personally > think that there is much room for improvement in crawling and especially > analysis. Nutch shouldn't just be about the shell but also the brains. > > And one of the biggest things I see is many newcomers to nutch have a very > hard time getting started. Part of this is understanding mapreduce > mentality, part is documentation, part is there is only so much time some of > us have to answer questions so some questions go unanswered on the lists. > How might this be improved going forward? > > Any other thoughts also welcome. Really I want to start a discussion about > where everyone thinks we are with the state of Nutch and its future. > > Dennis > >