Re: The Future of Nutch

buddha1021 Fri, 13 Mar 2009 19:43:00 -0700

hi dennis:

"Nutch's original intention was as a large-scale www search engine. "
I am very agreeing with you! Dennis! nutch's goal is specificly that achives
the goal like google to process the large-scale datas! There is no doubt
that nutch will be a www search engine absolutely,but absolutely not a
vertical search !


I am confident that hadoop can process the large datas of the  www search
engine! But lucene? I am afraid of the limited size of lucene's index per
server is very little ,10G? or 30G? this is not enough for the www search
engine! IMO, this is a bottleneck!

how many pages do visvo search currently? 100 millions? or 1000 millions?

IMO ,it will be very good that moving Nutch to a top level apache project
out from under 
the Lucene umbrella ! 

but all the sub-projects of nutch should be active enough, if not, nutch's
develop will be slow and it is no good for nutch's unity.

So the number of the sub-projects should be less !
 and  the sub-projects should be active ,efficient and also strong enough !

Good luck !



Dennis Kubes-2 wrote:
> 
> With the release of Nutch 1.0 I think it is a good time to begin a 
> discussion about the future of Nutch.  Here are some things to consider 
> and would love to here everyones views on this
> 
> Nutch's original intention was as a large-scale www search engine.  That 
> is a very specific goal.  Only a few people and organizations actually 
> use it on that level.  (I just happen to be one of them as most of my 
> work focuses on large scale web search as opposed to vertical search). 
> Many, perhaps most, people using Nutch these days are either using parts 
> of Nutch, such as the crawler, or are targeting towards vertical or 
> intranet type search engines.  This can be seen in how many people have 
> already started using the Solr integration features.  So while Nutch was 
> originally intended as a www search, IMO most people aren't using it for 
> that purpose.
> 
> Since there are different purposes for different users, would it be good 
> to consider moving Nutch to a top level apache project out from under 
> the Lucene umbrella?  This would then allow the creation of nutch 
> sub-projects, such as nutch-solr, nutch-hbase.  Thoughts?
> 
> Many parts of Nutch have also been implemented in other projects.  For 
> example, Tika for the parsers, Droids for the Crawler.  In begs the 
> question what is Nutch's core features going forward.  When I think 
> about search (again my perspective is large scale), I think crawling or 
> acquisition of data, parsing, analysis, indexing, deployment, and 
> searching.  I personally think that there is much room for improvement 
> in crawling and especially analysis.  Nutch shouldn't just be about the 
> shell but also the brains.
> 
> And one of the biggest things I see is many newcomers to nutch have a 
> very hard time getting started.  Part of this is understanding mapreduce 
> mentality, part is documentation, part is there is only so much time 
> some of us have to answer questions so some questions go unanswered on 
> the lists.  How might this be improved going forward?
> 
> Any other thoughts also welcome.  Really I want to start a discussion 
> about where everyone thinks we are with the state of Nutch and its future.
> 
> Dennis
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/The-Future-of-Nutch-tp22507507p22508747.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: The Future of Nutch

Reply via email to