+1 for making nutch top level.  I would like to see more integrated usage
with all the other tools like hadoop/hbase and even support for other
crawlers like heritrix.



On Fri, Mar 13, 2009 at 8:19 PM, Dennis Kubes <ku...@apache.org> wrote:

> With the release of Nutch 1.0 I think it is a good time to begin a
> discussion about the future of Nutch.  Here are some things to consider and
> would love to here everyones views on this
>
> Nutch's original intention was as a large-scale www search engine.  That is
> a very specific goal.  Only a few people and organizations actually use it
> on that level.  (I just happen to be one of them as most of my work focuses
> on large scale web search as opposed to vertical search). Many, perhaps
> most, people using Nutch these days are either using parts of Nutch, such as
> the crawler, or are targeting towards vertical or intranet type search
> engines.  This can be seen in how many people have already started using the
> Solr integration features.  So while Nutch was originally intended as a www
> search, IMO most people aren't using it for that purpose.
>
> Since there are different purposes for different users, would it be good to
> consider moving Nutch to a top level apache project out from under the
> Lucene umbrella?  This would then allow the creation of nutch sub-projects,
> such as nutch-solr, nutch-hbase.  Thoughts?
>
> Many parts of Nutch have also been implemented in other projects.  For
> example, Tika for the parsers, Droids for the Crawler.  In begs the question
> what is Nutch's core features going forward.  When I think about search
> (again my perspective is large scale), I think crawling or acquisition of
> data, parsing, analysis, indexing, deployment, and searching.  I personally
> think that there is much room for improvement in crawling and especially
> analysis.  Nutch shouldn't just be about the shell but also the brains.
>
> And one of the biggest things I see is many newcomers to nutch have a very
> hard time getting started.  Part of this is understanding mapreduce
> mentality, part is documentation, part is there is only so much time some of
> us have to answer questions so some questions go unanswered on the lists.
>  How might this be improved going forward?
>
> Any other thoughts also welcome.  Really I want to start a discussion about
> where everyone thinks we are with the state of Nutch and its future.
>
> Dennis
>
>

Reply via email to