Hi All

Based on the plan on the wiki <https://wiki.apache.org/tika/Tika2_0RoadMap> <https://wiki.apache.org/tika/Tika2_0MigrationGuide>, we still have a major breaking change or two planned for Tika 2 that we haven't yet "broken". (There's also removing some deprecated stuff etc)


As I understand it, the biggest breaking TODO change is around having multiple parsers available + active for a given format. This could be to support fallback parsers, eg "try this fancy new parser, but if it falls retry with this simpler one" or "try this xml parser, if that fails just try strings". A related but different case is to cleanly support multiple parsers covering different aspects, eg OCR an image plus extract metadata, or NER on the contents of a scientific PDF + text + metadata + NER of the OCR of embedded images in the PDF.

Currently, we can't cleanly do the former, and the latter is (badly) handled via one parser (eg OCR or NER) having an embedded hard-code reference to another (eg Image or PDF).


We've got some details on the proposed plans and ideas on the wiki:
https://wiki.apache.org/tika/CompositeParserDiscussion

The biggest stumbling block, as I see it, is how to let multiple parsers interact with the SAX content handler. For the fallback case, that's how to say "sorry, ignore all that XML we already sent, we're starting again with this XML now". For the multiple parser case, it's how we could have the image parser "finish" the (empty) XHTML but then have the OCR one send some text, or have the NER parser get at the XHTML text of the PDF + OCR of embedded images to enhance with the entities.


What do we think for this? Can we come up with a solution to let this go forward? Is there a pattern from elsewhere we can follow?

Or do we need to cancel this for 2.x, ponder it for another 1-2 years, and do this stuff in Tika 3 instead?

Nick

Reply via email to