I absolutely agree.  Duplicating the work and focusing on non-core when the 
same functionality can be gotten by using Tika is not wise for Nutch.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Andrzej Bialecki <a...@getopt.org>
> To: nutch-dev@lucene.apache.org
> Sent: Tuesday, March 10, 2009 5:57:36 AM
> Subject: Moving Nutch parsers to Tika
> 
> Hi all,
> 
> I've been debating this for a while, too, what Sami suggested in another 
> thread: 
> "I think we should start looking at Apache Tika for most (or all) of our 
> parsers."
> 
> This is actually a part of my broader vision for Nutch, that this project 
> should 
> not duplicate functionality of other well-established projects by 
> re-implementing the same functionality, only poorly - because our focus is 
> not 
> on parsers, plugins, mime/charset detection, distributed RPC, but on building 
> a 
> robust platform for crawling.
> 
> We could start working on this particular issue by donating the Nutch parsers 
> to 
> Tika, those that are not already present there, and start using Tika's 
> parsers 
> in Nutch where it's already possible. Once Tika supports all types of parsers 
> that we have, we should switch completely to Tika.
> 
> Of course, this will happen post-1.0 release.
> 
> -- Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com

Reply via email to