Re: The Future of Nutch

Bradford Stephens Fri, 27 Mar 2009 16:57:07 -0700

Hey there,

Just chiming in that we use the complete Nutch + Hadoop + Lucene stack
-- we download pages, index them for keywords, and then do heavy
Semantic Parsing on it to produce BI data. We also use a lot of
plug-ins for parsing and ranking information.


What we don't use is the 'built-in GUI search' ability... but besides
that, the core of our business is evolving around Nutch :)

2009/3/20 Doğacan Güney <doga...@gmail.com>:
> Hi,
>
> On Sat, Mar 14, 2009 at 02:19, Dennis Kubes <ku...@apache.org> wrote:
>> With the release of Nutch 1.0 I think it is a good time to begin a
>> discussion about the future of Nutch.  Here are some things to consider and
>> would love to here everyones views on this
>>
>> Nutch's original intention was as a large-scale www search engine.  That is
>> a very specific goal.  Only a few people and organizations actually use it
>> on that level.  (I just happen to be one of them as most of my work focuses
>> on large scale web search as opposed to vertical search). Many, perhaps
>> most, people using Nutch these days are either using parts of Nutch, such as
>> the crawler, or are targeting towards vertical or intranet type search
>> engines.  This can be seen in how many people have already started using the
>> Solr integration features.  So while Nutch was originally intended as a www
>> search, IMO most people aren't using it for that purpose.
>>
>> Since there are different purposes for different users, would it be good to
>> consider moving Nutch to a top level apache project out from under the
>> Lucene umbrella?  This would then allow the creation of nutch sub-projects,
>> such as nutch-solr, nutch-hbase.  Thoughts?
>>
>> Many parts of Nutch have also been implemented in other projects.  For
>> example, Tika for the parsers, Droids for the Crawler.  In begs the question
>> what is Nutch's core features going forward.  When I think about search
>> (again my perspective is large scale), I think crawling or acquisition of
>> data, parsing, analysis, indexing, deployment, and searching.  I personally
>> think that there is much room for improvement in crawling and especially
>> analysis.  Nutch shouldn't just be about the shell but also the brains.
>>
>
> I think nutch-solr and nutch-hbase should be in one unified project :)
>
> I can understand the difficulty (for newcomers) if we start depending
> on too many external projects. It would certainly be confusing
> to have to start a solr server then hbase master/slaves just to be
> able to crawl one intranet website locally. On the other hand,
> if we split nutch into nutch-hbase, nutch-hadoop and nutch-otherthings,
> I am worried we will have to create a waaay too generic interface
> to deal with them and not reap the advantages of using solr over
> lucene and hbase over hadoop. Also, more backends possibly
> mean more bugs and more integration problems.
>
> So I think delegating nutch functionality to other projects
> (tika/droids/solr/etc)
> is a great idea (so nutch can focus on "the brains" as Dennis said), but
> I don't like the idea of separating nutch into pieces.
>
> So I guess for a small vertical search engine, it may seem unnecessary
> to also deal with solr/etc, but as long as we have good documentation*,
> they are not that difficult to handle. And they don't have a large performance
> memory overhead.
>
> About vertical/large-scale search engine split: I guess a good example here
> is Dennis' FieldIndexer work. It is much more flexible for people who want
> to extend nutch's indexing architecture, but maybe overkill for people (and
> I am not convinced that it is) wanting to run vintage nutch on a small-scale.
> I, again, don't like splitting nutch into two(or three, four...) parts
> like this. But
> I think having different crawl paths for different users is much more 
> manageable
> than having different architectures. So we always use solr/hbase/etc. as our
> architecture. But you can run a one-job indexer if you want or run 
> FieldIndexer.
> You can use the on-the-fly scoring scheme or you use page rank/other complex
> offline scoring schemes.
>
>> And one of the biggest things I see is many newcomers to nutch have a very
>> hard time getting started.  Part of this is understanding mapreduce
>> mentality, part is documentation, part is there is only so much time some of
>> us have to answer questions so some questions go unanswered on the lists.
>>  How might this be improved going forward?
>>
>
> Docs, docs, docs :D
>
>> Any other thoughts also welcome.  Really I want to start a discussion about
>> where everyone thinks we are with the state of Nutch and its future.
>>
>
> Thanks for starting the discussion Dennis.
>
>> Dennis
>>
>>
>
> * And we don't have good documentation right now (and I am much
> to blame for it:). I think this should be an explicit goal for us in the
> future. I am thinking something like "no major features without documentation
> in the wiki".
>
>
>
> --
> Doğacan Güney
>

Re: The Future of Nutch

Reply via email to