Hi All
Based on the plan on the wiki
<https://wiki.apache.org/tika/Tika2_0RoadMap>
<https://wiki.apache.org/tika/Tika2_0MigrationGuide>, we still have a
major breaking change or two planned for Tika 2 that we haven't yet
"broken". (There's also removing some deprecated stuff etc)
As I understand it, the biggest breaking TODO change is around having
multiple parsers available + active for a given format. This could be to
support fallback parsers, eg "try this fancy new parser, but if it falls
retry with this simpler one" or "try this xml parser, if that fails just
try strings". A related but different case is to cleanly support multiple
parsers covering different aspects, eg OCR an image plus extract metadata,
or NER on the contents of a scientific PDF + text + metadata + NER of the
OCR of embedded images in the PDF.
Currently, we can't cleanly do the former, and the latter is (badly)
handled via one parser (eg OCR or NER) having an embedded hard-code
reference to another (eg Image or PDF).
We've got some details on the proposed plans and ideas on the wiki:
https://wiki.apache.org/tika/CompositeParserDiscussion
The biggest stumbling block, as I see it, is how to let multiple parsers
interact with the SAX content handler. For the fallback case, that's how
to say "sorry, ignore all that XML we already sent, we're starting again
with this XML now". For the multiple parser case, it's how we could have
the image parser "finish" the (empty) XHTML but then have the OCR one send
some text, or have the NER parser get at the XHTML text of the PDF + OCR
of embedded images to enhance with the entities.
What do we think for this? Can we come up with a solution to let this go
forward? Is there a pattern from elsewhere we can follow?
Or do we need to cancel this for 2.x, ponder it for another 1-2 years, and
do this stuff in Tika 3 instead?
Nick