Not-yet-broken breaking changes for Tika 2?

Nick Burch Thu, 26 Oct 2017 03:57:17 -0700

Hi All

Based on the plan on the wiki<https://wiki.apache.org/tika/Tika2_0RoadMap><https://wiki.apache.org/tika/Tika2_0MigrationGuide>, we still have amajor breaking change or two planned for Tika 2 that we haven't yet"broken". (There's also removing some deprecated stuff etc)

As I understand it, the biggest breaking TODO change is around havingmultiple parsers available + active for a given format. This could be tosupport fallback parsers, eg "try this fancy new parser, but if it fallsretry with this simpler one" or "try this xml parser, if that fails justtry strings". A related but different case is to cleanly support multipleparsers covering different aspects, eg OCR an image plus extract metadata,or NER on the contents of a scientific PDF + text + metadata + NER of theOCR of embedded images in the PDF.

Currently, we can't cleanly do the former, and the latter is (badly)handled via one parser (eg OCR or NER) having an embedded hard-codereference to another (eg Image or PDF).



We've got some details on the proposed plans and ideas on the wiki:
https://wiki.apache.org/tika/CompositeParserDiscussion

The biggest stumbling block, as I see it, is how to let multiple parsersinteract with the SAX content handler. For the fallback case, that's howto say "sorry, ignore all that XML we already sent, we're starting againwith this XML now". For the multiple parser case, it's how we could havethe image parser "finish" the (empty) XHTML but then have the OCR one sendsome text, or have the NER parser get at the XHTML text of the PDF + OCRof embedded images to enhance with the entities.

What do we think for this? Can we come up with a solution to let this goforward? Is there a pattern from elsewhere we can follow?

Or do we need to cancel this for 2.x, ponder it for another 1-2 years, anddo this stuff in Tika 3 instead?


Nick

Not-yet-broken breaking changes for Tika 2?

Reply via email to