Re: The Future of Nutch

Otis Gospodnetic Mon, 16 Mar 2009 18:00:38 -0700

Hi Tony,

You keep asking about this... :)
Please try to study the sources and please help us improve Nutch and its 
documentation by publishing what you've learned on the Nutch Wiki.  Everyone 
can add and edit pages on that Wiki.  You can't force people who volunteer 
their time into helping you by asking your question literally a dozen times! :) 
 If your question is not getting answered, perhaps people don't know or don't 
have time to help or don't have time or incentive to go learn it and help you.  
Or perhaps try saying what you've done and what errors you got, and so on, to 
make it easier for other to (try to) help you.



Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Tony Wang <ivyt...@gmail.com>
> To: nutch-user@lucene.apache.org
> Sent: Monday, March 16, 2009 8:53:57 PM
> Subject: Re: The Future of Nutch
> 
> I just wish there could be some clear documentation for Nutch/Solr
> integration publicly available. Or some developers are already working on
> this?
> - Tony
> 
> On Mon, Mar 16, 2009 at 6:50 PM, Otis Gospodnetic wrote:
> 
> >
> > Hello,
> >
> >
> > Comments inlined.
> >
> > ----- Original Message ----
> > > From: Dennis Kubes 
> > > To: nutch-user@lucene.apache.org
> > > Sent: Friday, March 13, 2009 8:19:37 PM
> > >
> > > With the release of Nutch 1.0 I think it is a good time to begin a
> > discussion
> > > about the future of Nutch.  Here are some things to consider and would
> > love to
> > > here everyones views on this
> > >
> > > Nutch's original intention was as a large-scale www search engine.  That
> > is a
> > > very specific goal.  Only a few people and organizations actually use it
> > on that
> > > level.  (I just happen to be one of them as most of my work focuses on
> > large
> > > scale web search as opposed to vertical search).
> >
> > Yes, there are fewer parties doing large scale web crawling.  Still, as
> > there is no alternative fetcher+parser+indexer+searcher capable of handling
> > large scale deployments like Nutch (or maybe Heritrix has the same scaling
> > capabilities?), I think Nutch's ability to perform web-wide crawls, etc.
> > should be preserved.
> >
> > > Many, perhaps most, people
> > > using Nutch these days are either using parts of Nutch, such as the
> > crawler, or
> > > are targeting towards vertical or intranet type search engines.  This can
> > be
> > > seen in how many people have already started using the Solr integration
> > > features.  So while Nutch was originally intended as a www search, IMO
> > most
> > > people aren't using it for that purpose.
> >
> >
> > That's my experience, too.  I think we can have both under the same Nutch
> > roof.
> >
> > > Since there are different purposes for different users, would it be good
> > to
> > > consider moving Nutch to a top level apache project out from under the
> > Lucene
> > > umbrella?  This would then allow the creation of nutch sub-projects, such
> > as
> > > nutch-solr, nutch-hbase.  Thoughts?
> >
> >
> > I disagree, at least in the near term.  There is nothing preventing those
> > sub-projects existing under Nutch today.  Both Solr and Lucene have the
> > contrib area where similar sub-projects live.  I think it's not a matter of
> > being a TLP, but rather attracting enough developer interest, then user
> > interest, and then contributor interest, so that these sub-projects can be
> > created, maintained, advanced.  Right now, Solr gets a TON of attention, as
> > does Lucene.  Nutch gets the least developer attention, and for some reason
> > the nutch-user subscribers "feel" a bit different from solr-user or
> > java-user subscribers.
> >
> > > Many parts of Nutch have also been implemented in other projects.  For
> > example,
> > > Tika for the parsers, Droids for the Crawler.  In begs the question what
> > is
> > > Nutch's core features going forward.  When I think about search (again my
> > > perspective is large scale), I think crawling or acquisition of data,
> > parsing,
> > > analysis, indexing, deployment, and searching.  I personally think that
> > there is
> > > much room for improvement in crawling and especially analysis.  Nutch
> > shouldn't
> > > just be about the shell but also the brains.
> >
> >
> > My feeling has long been that indexing and searching should be outsourced
> > to Solr, parsing to Tika, and that the fetcher should probably be replaced
> > with Droids.  I say probably because I'm not very familiar with Droids just
> > yet.  Nutch should, I think, then be an application built with all those
> > components combined (is that what you mean by the shell?), and then apply
> > its knowledge of either web-wide scale trickery, or vertical SE trickery, or
> > ...  I think that's where the brains are needed, to tie it all together,
> > while still making certain pieces swappable and more easily digestible by
> > potential new contributors and developers, as well as users.  I know plugins
> > do some of that already, but it seems like there might still be more in the
> > fore than there should/could be...
> >
> > > And one of the biggest things I see is many newcomers to nutch have a
> > very hard
> > > time getting started.  Part of this is understanding mapreduce mentality,
> > part
> > > is documentation, part is there is only so much time some of us have to
> > answer
> > > questions so some questions go unanswered on the lists.  How might this
> > be
> > > improved going forward?
> >
> > I am not 100% sure, but I think it's a bit of all of the above.  Lucene has
> > been around for 10 years and from day one had people answer questions from
> > the most basic ones to the trickiest ones.  It's the same with Solr today.
> >  Nutch has the least active and the smallest developer base, so questions
> > don't get answered.  Again, people on this list also tend to have a
> > different "style" of asking questions - no hellos, no thank yous, and so on,
> > which doesn't help.
> >
> >
> > I think the existence of a book on Lucene helped Lucene, but Solr doesn't
> > yet have a book, yet it still has a healthy developer and user community.  I
> > think that's because Solr is simply more needed by more people than Nutch
> > is.
> >
> > > Any other thoughts also welcome.  Really I want to start a discussion
> > about where everyone thinks we are with the state of Nutch and its future.
> >
> >
> > I think it's good you started this discussion.  My opinion about what needs
> > to be done with Nutch is above.  I think it needs to stay with Hadoop.  I
> > think it should remain under Lucene for now.  Once and iff it develops those
> > sub-projects and we all feel it's better for it to be TLP, then I think we
> > can bring this up again.
> >
> > Otis
> >
> >
> 
> 
> -- 
> Are you RCholic? www.RCholic.com
> 温 良 恭 俭 让 仁 义 礼 智 信

Re: The Future of Nutch

Reply via email to