Andrzej Bialecki wrote:
Hi all,
I've been debating this for a while, too, what Sami suggested in another
thread: "I think we should start looking at Apache Tika for most (or
all) of our parsers."
This is actually a part of my broader vision for Nutch, that this
project should not duplicate functionality of other well-established
projects by re-implementing the same functionality, only poorly -
because our focus is not on parsers, plugins, mime/charset detection,
distributed RPC, but on building a robust platform for crawling.
I share that same vision.
We could start working on this particular issue by donating the Nutch
parsers to Tika, those that are not already present there, and start
using Tika's parsers in Nutch where it's already possible. Once Tika
supports all types of parsers that we have, we should switch completely
to Tika.
I think that the only parser that is totally missing from Tika is swf
(https://issues.apache.org/jira/browse/TIKA-147). Tika also supports
some formats that Nutch currently does not (in addition to providing
more advanced parsing on some formats).
--
Sami Siren