Re: The Future of Nutch

buddha1021 Sat, 14 Mar 2009 05:45:37 -0700

hi:
people how to use nutch and what's the ablity of nutch can do is very
different ! and ,The two are not contradictory! 
when I say "There is no doubt that nutch will be a www search engine
absolutely,but absolutely not a vertical search !"  ,I want to emphasize the
goal of nutch !


like linux ,can as server ,because it is so stable and efficient !
but people can also use linux as desktop! because it is very easy for linux
kernel which is able to run as a server OS!

nutch's goal is a large-scale www search engine,if this goal can be achived
sucessfully,It would be very easy to create a vertical search engine with
nutch!

So ,I want to emphasize the goal of nutch ,but not the way people how to use
nutch.

Integrating all the information of the internet is a very great project!
 hadoop has solved the key problem that process the mass data (TB),so ,nutch
is hopeful to be a second google!

if this goal is achived sucessfully, in a sense, nutch will be the next
linux !


yanky young wrote:
> 
> Hi:
> 
> I also agree that the most usage scenarios of nutch are in vertical search
> area. and in some unusual case users may don't even use nutch indexing at
> all. they just crawl some pages as mirror purpose. and in some cases of
> vertical search, user only need a fraction of pages, e.g. house rent info,
> restraunt info. so how about distribute nutch as components so that
> crawler
> can be indepently used without indexing. that's actually what droid is
> trying to address. but nutch can also do it in a more scalable way.
> 
> another point is that, if nutch need to be more easily customized for
> special cases such as vertical search, new ranking machenism must be
> introduced. tf/idf just can not work. maybe machine learning scheme such
> as
> text classifier can be employed.
> 
> it is great for nutch to be a top apache project, because subprojects for
> special case can be created for easier customization.
> 
> i have also seen posts about using spring as nutch components assemely
> framework. maybe it can be created as subproject for spring users.
> 
> just my 2 cents
> 
> good luck
> 
> yanky
> 
> 2009/3/14 buddha1021 <buddha1...@yahoo.cn>
> 
>>
>> hi dennis:
>>
>> "Nutch's original intention was as a large-scale www search engine. "
>> I am very agreeing with you! Dennis! nutch's goal is specificly that
>> achives
>> the goal like google to process the large-scale datas! There is no doubt
>> that nutch will be a www search engine absolutely,but absolutely not a
>> vertical search !
>>
>> I am confident that hadoop can process the large datas of the  www search
>> engine! But lucene? I am afraid of the limited size of lucene's index per
>> server is very little ,10G? or 30G? this is not enough for the www search
>> engine! IMO, this is a bottleneck!
>>
>> how many pages do visvo search currently? 100 millions? or 1000 millions?
>>
>> IMO ,it will be very good that moving Nutch to a top level apache project
>> out from under
>> the Lucene umbrella !
>>
>> but all the sub-projects of nutch should be active enough, if not,
>> nutch's
>> develop will be slow and it is no good for nutch's unity.
>>
>> So the number of the sub-projects should be less !
>>  and  the sub-projects should be active ,efficient and also strong enough
>> !
>>
>> Good luck !
>>
>>
>>
>> Dennis Kubes-2 wrote:
>> >
>> > With the release of Nutch 1.0 I think it is a good time to begin a
>> > discussion about the future of Nutch.  Here are some things to consider
>> > and would love to here everyones views on this
>> >
>> > Nutch's original intention was as a large-scale www search engine. 
>> That
>> > is a very specific goal.  Only a few people and organizations actually
>> > use it on that level.  (I just happen to be one of them as most of my
>> > work focuses on large scale web search as opposed to vertical search).
>> > Many, perhaps most, people using Nutch these days are either using
>> parts
>> > of Nutch, such as the crawler, or are targeting towards vertical or
>> > intranet type search engines.  This can be seen in how many people have
>> > already started using the Solr integration features.  So while Nutch
>> was
>> > originally intended as a www search, IMO most people aren't using it
>> for
>> > that purpose.
>> >
>> > Since there are different purposes for different users, would it be
>> good
>> > to consider moving Nutch to a top level apache project out from under
>> > the Lucene umbrella?  This would then allow the creation of nutch
>> > sub-projects, such as nutch-solr, nutch-hbase.  Thoughts?
>> >
>> > Many parts of Nutch have also been implemented in other projects.  For
>> > example, Tika for the parsers, Droids for the Crawler.  In begs the
>> > question what is Nutch's core features going forward.  When I think
>> > about search (again my perspective is large scale), I think crawling or
>> > acquisition of data, parsing, analysis, indexing, deployment, and
>> > searching.  I personally think that there is much room for improvement
>> > in crawling and especially analysis.  Nutch shouldn't just be about the
>> > shell but also the brains.
>> >
>> > And one of the biggest things I see is many newcomers to nutch have a
>> > very hard time getting started.  Part of this is understanding
>> mapreduce
>> > mentality, part is documentation, part is there is only so much time
>> > some of us have to answer questions so some questions go unanswered on
>> > the lists.  How might this be improved going forward?
>> >
>> > Any other thoughts also welcome.  Really I want to start a discussion
>> > about where everyone thinks we are with the state of Nutch and its
>> future.
>> >
>> > Dennis
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/The-Future-of-Nutch-tp22507507p22508747.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/The-Future-of-Nutch-tp22507507p22512299.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: The Future of Nutch

Reply via email to