I absolutely agree. Duplicating the work and focusing on non-core when the same functionality can be gotten by using Tika is not wise for Nutch.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Andrzej Bialecki <a...@getopt.org> > To: nutch-dev@lucene.apache.org > Sent: Tuesday, March 10, 2009 5:57:36 AM > Subject: Moving Nutch parsers to Tika > > Hi all, > > I've been debating this for a while, too, what Sami suggested in another > thread: > "I think we should start looking at Apache Tika for most (or all) of our > parsers." > > This is actually a part of my broader vision for Nutch, that this project > should > not duplicate functionality of other well-established projects by > re-implementing the same functionality, only poorly - because our focus is > not > on parsers, plugins, mime/charset detection, distributed RPC, but on building > a > robust platform for crawling. > > We could start working on this particular issue by donating the Nutch parsers > to > Tika, those that are not already present there, and start using Tika's > parsers > in Nutch where it's already possible. Once Tika supports all types of parsers > that we have, we should switch completely to Tika. > > Of course, this will happen post-1.0 release. > > -- Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com