I agree that the goal is to make it easier to use Nutch in the future,
not harder. And I am not saying we should move away from hadoop. I am
saying that currently people use nutch for different purposes (vertical,
www, solr, intranet) and what is the best way for nutch to evolve in the
future to support that? Definitely want to keep an eye on how people
are using Nutch.
Dennis
John Martyniak wrote:
I think that this would be the case for making Nutch a top level Apache
Project. So that you can publish the framework and a complete app but
still tie it all together. Because personally I think that is the
strength of Nutch, that you can use it right out of the box, without
programming. But all of extensibility (customization) is there so that
you can extend it if you so desire.
-John
On Mar 14, 2009, at 9:44 AM, consultas wrote:
I am using Nutch for more than four years now, as a vertical search
engine, having indexed, some times, over one million pages. On the
other hand, I dont know nothing about programming and some specialized
aplications. Words like solr and others are like aliens for me. I am
just interested in a search engine that someone can, really, use and
not an application that serve as a base for developping sophisticated
models.
So, what I, personally want for the future of Nutch is that it does
not turn in such a complicated aplication that just some very skilled
people can use.
So I hope that Nutch keeps, allways, an eye on the real users, that
want it for plain searching.
Thanks
----- Original Message ----- From: "Dennis Kubes" <ku...@apache.org>
To: <nutch-user@lucene.apache.org>
Sent: Friday, March 13, 2009 9:19 PM
Subject: The Future of Nutch
With the release of Nutch 1.0 I think it is a good time to begin a
discussion about the future of Nutch. Here are some things to consider
and would love to here everyones views on this
Nutch's original intention was as a large-scale www search engine. That
is a very specific goal. Only a few people and organizations actually
use it on that level. (I just happen to be one of them as most of my
work focuses on large scale web search as opposed to vertical search).
Many, perhaps most, people using Nutch these days are either using parts
of Nutch, such as the crawler, or are targeting towards vertical or
intranet type search engines. This can be seen in how many people have
already started using the Solr integration features. So while Nutch was
originally intended as a www search, IMO most people aren't using it for
that purpose.
Since there are different purposes for different users, would it be good
to consider moving Nutch to a top level apache project out from under
the Lucene umbrella? This would then allow the creation of nutch
sub-projects, such as nutch-solr, nutch-hbase. Thoughts?
Many parts of Nutch have also been implemented in other projects. For
example, Tika for the parsers, Droids for the Crawler. In begs the
question what is Nutch's core features going forward. When I think
about search (again my perspective is large scale), I think crawling or
acquisition of data, parsing, analysis, indexing, deployment, and
searching. I personally think that there is much room for improvement
in crawling and especially analysis. Nutch shouldn't just be about the
shell but also the brains.
And one of the biggest things I see is many newcomers to nutch have a
very hard time getting started. Part of this is understanding mapreduce
mentality, part is documentation, part is there is only so much time
some of us have to answer questions so some questions go unanswered on
the lists. How might this be improved going forward?
Any other thoughts also welcome. Really I want to start a discussion
about where everyone thinks we are with the state of Nutch and its
future.
Dennis
--------------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.0.237 / Virus Database: 270.11.13/2001 - Release Date:
03/14/09 06:54:00