Re: The Future of Nutch

Tony Wang Mon, 16 Mar 2009 17:54:30 -0700

I just wish there could be some clear documentation for Nutch/Solr
integration publicly available. Or some developers are already working on
this?
- Tony


On Mon, Mar 16, 2009 at 6:50 PM, Otis Gospodnetic <ogjunk-nu...@yahoo.com>wrote:

>
> Hello,
>
>
> Comments inlined.
>
> ----- Original Message ----
> > From: Dennis Kubes <ku...@apache.org>
> > To: nutch-user@lucene.apache.org
> > Sent: Friday, March 13, 2009 8:19:37 PM
> >
> > With the release of Nutch 1.0 I think it is a good time to begin a
> discussion
> > about the future of Nutch.  Here are some things to consider and would
> love to
> > here everyones views on this
> >
> > Nutch's original intention was as a large-scale www search engine.  That
> is a
> > very specific goal.  Only a few people and organizations actually use it
> on that
> > level.  (I just happen to be one of them as most of my work focuses on
> large
> > scale web search as opposed to vertical search).
>
> Yes, there are fewer parties doing large scale web crawling.  Still, as
> there is no alternative fetcher+parser+indexer+searcher capable of handling
> large scale deployments like Nutch (or maybe Heritrix has the same scaling
> capabilities?), I think Nutch's ability to perform web-wide crawls, etc.
> should be preserved.
>
> > Many, perhaps most, people
> > using Nutch these days are either using parts of Nutch, such as the
> crawler, or
> > are targeting towards vertical or intranet type search engines.  This can
> be
> > seen in how many people have already started using the Solr integration
> > features.  So while Nutch was originally intended as a www search, IMO
> most
> > people aren't using it for that purpose.
>
>
> That's my experience, too.  I think we can have both under the same Nutch
> roof.
>
> > Since there are different purposes for different users, would it be good
> to
> > consider moving Nutch to a top level apache project out from under the
> Lucene
> > umbrella?  This would then allow the creation of nutch sub-projects, such
> as
> > nutch-solr, nutch-hbase.  Thoughts?
>
>
> I disagree, at least in the near term.  There is nothing preventing those
> sub-projects existing under Nutch today.  Both Solr and Lucene have the
> contrib area where similar sub-projects live.  I think it's not a matter of
> being a TLP, but rather attracting enough developer interest, then user
> interest, and then contributor interest, so that these sub-projects can be
> created, maintained, advanced.  Right now, Solr gets a TON of attention, as
> does Lucene.  Nutch gets the least developer attention, and for some reason
> the nutch-user subscribers "feel" a bit different from solr-user or
> java-user subscribers.
>
> > Many parts of Nutch have also been implemented in other projects.  For
> example,
> > Tika for the parsers, Droids for the Crawler.  In begs the question what
> is
> > Nutch's core features going forward.  When I think about search (again my
> > perspective is large scale), I think crawling or acquisition of data,
> parsing,
> > analysis, indexing, deployment, and searching.  I personally think that
> there is
> > much room for improvement in crawling and especially analysis.  Nutch
> shouldn't
> > just be about the shell but also the brains.
>
>
> My feeling has long been that indexing and searching should be outsourced
> to Solr, parsing to Tika, and that the fetcher should probably be replaced
> with Droids.  I say probably because I'm not very familiar with Droids just
> yet.  Nutch should, I think, then be an application built with all those
> components combined (is that what you mean by the shell?), and then apply
> its knowledge of either web-wide scale trickery, or vertical SE trickery, or
> ...  I think that's where the brains are needed, to tie it all together,
> while still making certain pieces swappable and more easily digestible by
> potential new contributors and developers, as well as users.  I know plugins
> do some of that already, but it seems like there might still be more in the
> fore than there should/could be...
>
> > And one of the biggest things I see is many newcomers to nutch have a
> very hard
> > time getting started.  Part of this is understanding mapreduce mentality,
> part
> > is documentation, part is there is only so much time some of us have to
> answer
> > questions so some questions go unanswered on the lists.  How might this
> be
> > improved going forward?
>
> I am not 100% sure, but I think it's a bit of all of the above.  Lucene has
> been around for 10 years and from day one had people answer questions from
> the most basic ones to the trickiest ones.  It's the same with Solr today.
>  Nutch has the least active and the smallest developer base, so questions
> don't get answered.  Again, people on this list also tend to have a
> different "style" of asking questions - no hellos, no thank yous, and so on,
> which doesn't help.
>
>
> I think the existence of a book on Lucene helped Lucene, but Solr doesn't
> yet have a book, yet it still has a healthy developer and user community.  I
> think that's because Solr is simply more needed by more people than Nutch
> is.
>
> > Any other thoughts also welcome.  Really I want to start a discussion
> about where everyone thinks we are with the state of Nutch and its future.
>
>
> I think it's good you started this discussion.  My opinion about what needs
> to be done with Nutch is above.  I think it needs to stay with Hadoop.  I
> think it should remain under Lucene for now.  Once and iff it develops those
> sub-projects and we all feel it's better for it to be TLP, then I think we
> can bring this up again.
>
> Otis
>
>


-- 
Are you RCholic? www.RCholic.com
温 良 恭 俭 让 仁 义 礼 智 信

Re: The Future of Nutch

Reply via email to