Hey there, Just chiming in that we use the complete Nutch + Hadoop + Lucene stack -- we download pages, index them for keywords, and then do heavy Semantic Parsing on it to produce BI data. We also use a lot of plug-ins for parsing and ranking information.
What we don't use is the 'built-in GUI search' ability... but besides that, the core of our business is evolving around Nutch :) 2009/3/20 Doğacan Güney <doga...@gmail.com>: > Hi, > > On Sat, Mar 14, 2009 at 02:19, Dennis Kubes <ku...@apache.org> wrote: >> With the release of Nutch 1.0 I think it is a good time to begin a >> discussion about the future of Nutch. Here are some things to consider and >> would love to here everyones views on this >> >> Nutch's original intention was as a large-scale www search engine. That is >> a very specific goal. Only a few people and organizations actually use it >> on that level. (I just happen to be one of them as most of my work focuses >> on large scale web search as opposed to vertical search). Many, perhaps >> most, people using Nutch these days are either using parts of Nutch, such as >> the crawler, or are targeting towards vertical or intranet type search >> engines. This can be seen in how many people have already started using the >> Solr integration features. So while Nutch was originally intended as a www >> search, IMO most people aren't using it for that purpose. >> >> Since there are different purposes for different users, would it be good to >> consider moving Nutch to a top level apache project out from under the >> Lucene umbrella? This would then allow the creation of nutch sub-projects, >> such as nutch-solr, nutch-hbase. Thoughts? >> >> Many parts of Nutch have also been implemented in other projects. For >> example, Tika for the parsers, Droids for the Crawler. In begs the question >> what is Nutch's core features going forward. When I think about search >> (again my perspective is large scale), I think crawling or acquisition of >> data, parsing, analysis, indexing, deployment, and searching. I personally >> think that there is much room for improvement in crawling and especially >> analysis. Nutch shouldn't just be about the shell but also the brains. >> > > I think nutch-solr and nutch-hbase should be in one unified project :) > > I can understand the difficulty (for newcomers) if we start depending > on too many external projects. It would certainly be confusing > to have to start a solr server then hbase master/slaves just to be > able to crawl one intranet website locally. On the other hand, > if we split nutch into nutch-hbase, nutch-hadoop and nutch-otherthings, > I am worried we will have to create a waaay too generic interface > to deal with them and not reap the advantages of using solr over > lucene and hbase over hadoop. Also, more backends possibly > mean more bugs and more integration problems. > > So I think delegating nutch functionality to other projects > (tika/droids/solr/etc) > is a great idea (so nutch can focus on "the brains" as Dennis said), but > I don't like the idea of separating nutch into pieces. > > So I guess for a small vertical search engine, it may seem unnecessary > to also deal with solr/etc, but as long as we have good documentation*, > they are not that difficult to handle. And they don't have a large performance > memory overhead. > > About vertical/large-scale search engine split: I guess a good example here > is Dennis' FieldIndexer work. It is much more flexible for people who want > to extend nutch's indexing architecture, but maybe overkill for people (and > I am not convinced that it is) wanting to run vintage nutch on a small-scale. > I, again, don't like splitting nutch into two(or three, four...) parts > like this. But > I think having different crawl paths for different users is much more > manageable > than having different architectures. So we always use solr/hbase/etc. as our > architecture. But you can run a one-job indexer if you want or run > FieldIndexer. > You can use the on-the-fly scoring scheme or you use page rank/other complex > offline scoring schemes. > >> And one of the biggest things I see is many newcomers to nutch have a very >> hard time getting started. Part of this is understanding mapreduce >> mentality, part is documentation, part is there is only so much time some of >> us have to answer questions so some questions go unanswered on the lists. >> How might this be improved going forward? >> > > Docs, docs, docs :D > >> Any other thoughts also welcome. Really I want to start a discussion about >> where everyone thinks we are with the state of Nutch and its future. >> > > Thanks for starting the discussion Dennis. > >> Dennis >> >> > > * And we don't have good documentation right now (and I am much > to blame for it:). I think this should be an explicit goal for us in the > future. I am thinking something like "no major features without documentation > in the wiki". > > > > -- > Doğacan Güney >