I just wish there could be some clear documentation for Nutch/Solr integration publicly available. Or some developers are already working on this? - Tony
On Mon, Mar 16, 2009 at 6:50 PM, Otis Gospodnetic <ogjunk-nu...@yahoo.com>wrote: > > Hello, > > > Comments inlined. > > ----- Original Message ---- > > From: Dennis Kubes <ku...@apache.org> > > To: nutch-user@lucene.apache.org > > Sent: Friday, March 13, 2009 8:19:37 PM > > > > With the release of Nutch 1.0 I think it is a good time to begin a > discussion > > about the future of Nutch. Here are some things to consider and would > love to > > here everyones views on this > > > > Nutch's original intention was as a large-scale www search engine. That > is a > > very specific goal. Only a few people and organizations actually use it > on that > > level. (I just happen to be one of them as most of my work focuses on > large > > scale web search as opposed to vertical search). > > Yes, there are fewer parties doing large scale web crawling. Still, as > there is no alternative fetcher+parser+indexer+searcher capable of handling > large scale deployments like Nutch (or maybe Heritrix has the same scaling > capabilities?), I think Nutch's ability to perform web-wide crawls, etc. > should be preserved. > > > Many, perhaps most, people > > using Nutch these days are either using parts of Nutch, such as the > crawler, or > > are targeting towards vertical or intranet type search engines. This can > be > > seen in how many people have already started using the Solr integration > > features. So while Nutch was originally intended as a www search, IMO > most > > people aren't using it for that purpose. > > > That's my experience, too. I think we can have both under the same Nutch > roof. > > > Since there are different purposes for different users, would it be good > to > > consider moving Nutch to a top level apache project out from under the > Lucene > > umbrella? This would then allow the creation of nutch sub-projects, such > as > > nutch-solr, nutch-hbase. Thoughts? > > > I disagree, at least in the near term. There is nothing preventing those > sub-projects existing under Nutch today. Both Solr and Lucene have the > contrib area where similar sub-projects live. I think it's not a matter of > being a TLP, but rather attracting enough developer interest, then user > interest, and then contributor interest, so that these sub-projects can be > created, maintained, advanced. Right now, Solr gets a TON of attention, as > does Lucene. Nutch gets the least developer attention, and for some reason > the nutch-user subscribers "feel" a bit different from solr-user or > java-user subscribers. > > > Many parts of Nutch have also been implemented in other projects. For > example, > > Tika for the parsers, Droids for the Crawler. In begs the question what > is > > Nutch's core features going forward. When I think about search (again my > > perspective is large scale), I think crawling or acquisition of data, > parsing, > > analysis, indexing, deployment, and searching. I personally think that > there is > > much room for improvement in crawling and especially analysis. Nutch > shouldn't > > just be about the shell but also the brains. > > > My feeling has long been that indexing and searching should be outsourced > to Solr, parsing to Tika, and that the fetcher should probably be replaced > with Droids. I say probably because I'm not very familiar with Droids just > yet. Nutch should, I think, then be an application built with all those > components combined (is that what you mean by the shell?), and then apply > its knowledge of either web-wide scale trickery, or vertical SE trickery, or > ... I think that's where the brains are needed, to tie it all together, > while still making certain pieces swappable and more easily digestible by > potential new contributors and developers, as well as users. I know plugins > do some of that already, but it seems like there might still be more in the > fore than there should/could be... > > > And one of the biggest things I see is many newcomers to nutch have a > very hard > > time getting started. Part of this is understanding mapreduce mentality, > part > > is documentation, part is there is only so much time some of us have to > answer > > questions so some questions go unanswered on the lists. How might this > be > > improved going forward? > > I am not 100% sure, but I think it's a bit of all of the above. Lucene has > been around for 10 years and from day one had people answer questions from > the most basic ones to the trickiest ones. It's the same with Solr today. > Nutch has the least active and the smallest developer base, so questions > don't get answered. Again, people on this list also tend to have a > different "style" of asking questions - no hellos, no thank yous, and so on, > which doesn't help. > > > I think the existence of a book on Lucene helped Lucene, but Solr doesn't > yet have a book, yet it still has a healthy developer and user community. I > think that's because Solr is simply more needed by more people than Nutch > is. > > > Any other thoughts also welcome. Really I want to start a discussion > about where everyone thinks we are with the state of Nutch and its future. > > > I think it's good you started this discussion. My opinion about what needs > to be done with Nutch is above. I think it needs to stay with Hadoop. I > think it should remain under Lucene for now. Once and iff it develops those > sub-projects and we all feel it's better for it to be TLP, then I think we > can bring this up again. > > Otis > > -- Are you RCholic? www.RCholic.com 温 良 恭 俭 让 仁 义 礼 智 信