Andrzej Bialecki wrote:
Hi all,

I've been debating this for a while, too, what Sami suggested in another thread: "I think we should start looking at Apache Tika for most (or all) of our parsers."

This is actually a part of my broader vision for Nutch, that this project should not duplicate functionality of other well-established projects by re-implementing the same functionality, only poorly - because our focus is not on parsers, plugins, mime/charset detection, distributed RPC, but on building a robust platform for crawling.

I share that same vision.


We could start working on this particular issue by donating the Nutch parsers to Tika, those that are not already present there, and start using Tika's parsers in Nutch where it's already possible. Once Tika supports all types of parsers that we have, we should switch completely to Tika.

I think that the only parser that is totally missing from Tika is swf (https://issues.apache.org/jira/browse/TIKA-147). Tika also supports some formats that Nutch currently does not (in addition to providing more advanced parsing on some formats).

--
 Sami Siren

Reply via email to