Re: The Future of Nutch

Dennis Kubes Tue, 17 Mar 2009 18:22:24 -0700

Marc,

Glad you responded.  Always good to hear peoples thoughts.


Marc Boucher wrote:

Dennis, Otis et al,

My very small team has kept silent for a long time. We've been playing
with Nutch, Hadoop and to a lesser extent Solr for about 2 years now.
Before I get into my thoughts on what direction things should take I
would like to offer a thought on why Nutch is not as active as other
groups.

I think in part it's because what Nutch represents and that's the
ability of creating a large scale search. Some developers would rather
use Nutch and associated tools and keep quiet about it because of
their goals, which in some case might mean competing against the likes
of Google, Yahoo, Ask, MSN Live etc. For my part I'm not going to
compete with those companies on large scale search but I can see
competition in the vertical markets. And while Solr is hot these days
it's intended primarily for the enterprise market which is very
different than the large scale and vertical markets.

I completely agree. The group of people/companies that are creatinglarge scale search solutions whether whole web or vertical is muchsmaller than say enterprise search or even the potential uses for Hadoop.


Now on to the future. I agree with many of the thoughts Otis put forward.

While Nutch has it's problems other than Heritrix there is no other
open source system available and Nutch's ability to perform web-wide
crawls must be preserved. However I'm thinking we should have modular
approach to Nutch. For instance, why just one fetcher? Why not keep
the current one but also allow for the possibility of using Droids?
Parsing can and should include Tika. I'm not sure about outsourcing
indexing and searching to Solr but that could be a modular option as
well.

Yup. It should IMO also be easy to install and configure. I was havinga discussion today where the main topic was, could we make Nutch have anice graphical web interface for configuration, when you could drop itin, change some options, and create a customized vertical search over xdomains?


I'm not sure if Nutch should become a top level project and move out
from under Lucene. Lucene has great visibility and for many reasons.
If Nutch was moved, would it still attract enough attention? It's been
noted that developer interest in Nutch is different that Lucene, Solr
etc. On the other hand it might do Nutch good to go TLP as maybe then
it would attract more developers especially if it was packaged
differently.

Part of this is about releases. Currently releases are voted on byLucene PMC members and it takes 3 members to confirm a vote. There areonly 2 Nutch committers on the Lucene PMC. So for releases, not that wehave had many recently, other Lucene PMC members who may not be activelyassociated with Nutch would need to vote to release. If Nutch was a TLPthere would be a Nutch PMC which would most likely include all currentNutch committers. The other may be to add another Nutch committer tothe Lucene PMC.


My thoughts. And hopefully in the near future my small team will be
able to contribute to Nutch in a meaningful way.


Any and every contribution is welcome.

Dennis


Marc Boucher
http://hyperix.com

On Mon, Mar 16, 2009 at 5:50 PM, Otis Gospodnetic
<ogjunk-nu...@yahoo.com> wrote:

Hello,


Comments inlined.

----- Original Message ----

From: Dennis Kubes <ku...@apache.org>
To: nutch-user@lucene.apache.org
Sent: Friday, March 13, 2009 8:19:37 PM

With the release of Nutch 1.0 I think it is a good time to begin a discussion
about the future of Nutch.  Here are some things to consider and would love to
here everyones views on this

Nutch's original intention was as a large-scale www search engine.  That is a
very specific goal.  Only a few people and organizations actually use it on that
level.  (I just happen to be one of them as most of my work focuses on large
scale web search as opposed to vertical search).

Yes, there are fewer parties doing large scale web crawling.  Still, as there 
is no alternative fetcher+parser+indexer+searcher capable of handling large 
scale deployments like Nutch (or maybe Heritrix has the same scaling 
capabilities?), I think Nutch's ability to perform web-wide crawls, etc. should 
be preserved.

Many, perhaps most, people
using Nutch these days are either using parts of Nutch, such as the crawler, or
are targeting towards vertical or intranet type search engines.  This can be
seen in how many people have already started using the Solr integration
features.  So while Nutch was originally intended as a www search, IMO most
people aren't using it for that purpose.


That's my experience, too.  I think we can have both under the same Nutch roof.

Since there are different purposes for different users, would it be good to
consider moving Nutch to a top level apache project out from under the Lucene
umbrella?  This would then allow the creation of nutch sub-projects, such as
nutch-solr, nutch-hbase.  Thoughts?


I disagree, at least in the near term.  There is nothing preventing those sub-projects 
existing under Nutch today.  Both Solr and Lucene have the contrib area where similar 
sub-projects live.  I think it's not a matter of being a TLP, but rather attracting 
enough developer interest, then user interest, and then contributor interest, so that 
these sub-projects can be created, maintained, advanced.  Right now, Solr gets a TON of 
attention, as does Lucene.  Nutch gets the least developer attention, and for some reason 
the nutch-user subscribers "feel" a bit different from solr-user or java-user 
subscribers.

Many parts of Nutch have also been implemented in other projects.  For example,
Tika for the parsers, Droids for the Crawler.  In begs the question what is
Nutch's core features going forward.  When I think about search (again my
perspective is large scale), I think crawling or acquisition of data, parsing,
analysis, indexing, deployment, and searching.  I personally think that there is
much room for improvement in crawling and especially analysis.  Nutch shouldn't
just be about the shell but also the brains.


My feeling has long been that indexing and searching should be outsourced to 
Solr, parsing to Tika, and that the fetcher should probably be replaced with 
Droids.  I say probably because I'm not very familiar with Droids just yet.  
Nutch should, I think, then be an application built with all those components 
combined (is that what you mean by the shell?), and then apply its knowledge of 
either web-wide scale trickery, or vertical SE trickery, or ...  I think that's 
where the brains are needed, to tie it all together, while still making certain 
pieces swappable and more easily digestible by potential new contributors and 
developers, as well as users.  I know plugins do some of that already, but it 
seems like there might still be more in the fore than there should/could be...

And one of the biggest things I see is many newcomers to nutch have a very hard
time getting started.  Part of this is understanding mapreduce mentality, part
is documentation, part is there is only so much time some of us have to answer
questions so some questions go unanswered on the lists.  How might this be
improved going forward?

I am not 100% sure, but I think it's a bit of all of the above.  Lucene has been around 
for 10 years and from day one had people answer questions from the most basic ones to the 
trickiest ones.  It's the same with Solr today.  Nutch has the least active and the 
smallest developer base, so questions don't get answered.  Again, people on this list 
also tend to have a different "style" of asking questions - no hellos, no thank 
yous, and so on, which doesn't help.


I think the existence of a book on Lucene helped Lucene, but Solr doesn't yet 
have a book, yet it still has a healthy developer and user community.  I think 
that's because Solr is simply more needed by more people than Nutch is.

Any other thoughts also welcome.  Really I want to start a discussion about 
where everyone thinks we are with the state of Nutch and its future.


I think it's good you started this discussion.  My opinion about what needs to 
be done with Nutch is above.  I think it needs to stay with Hadoop.  I think it 
should remain under Lucene for now.  Once and iff it develops those 
sub-projects and we all feel it's better for it to be TLP, then I think we can 
bring this up again.

Otis

Re: The Future of Nutch

Reply via email to