At this point, I'm willing to punt to 3.x, unless there's momentum for either of these two. They would be great to have!
1) chaining multiple parsers -- additive This shouldn't be too bad, except where there's conflicting metadata -- parser1 says author is 'bob', parser2 says author is 'alice'. We would break some uniqueness guarantees for some Properties that should only allow a single value if we added those values... Overwriting feels like a bad idea. Perhaps we remove the uniqueness guarantees when in "additive" mode ... or let users select additive/overwrite? 2) fallback parsers >The biggest stumbling block, as I see it, is how to let multiple parsers >interact with the SAX content handler. For the fallback case, that's how to >say "sorry, ignore all that XML we already sent, we're starting again with >this XML now". Y, this has been what's holding me back. How do we create a resettable handler that doesn't have us mucking too much with all of our current handlers. For those with outputstreams/writers, I imagine we'd require a resettable OutputStream...TikaOutputStream(?) TikaOutputStream() --underling stringwriter, when reset, would just be a new stringwriter on reset() ??? Not quite right... TikaOutputStream.get(Path/File) -- would hold the underlying file/path, close the writer, and just rewrite on reset() TikaOutputStream.get(ByteArrayOutputStream) baos has a reset() so that should work... What other use cases? -----Original Message----- From: Nick Burch [mailto:[email protected]] Sent: Thursday, October 26, 2017 6:57 AM To: [email protected] Subject: Not-yet-broken breaking changes for Tika 2? Hi All Based on the plan on the wiki <https://wiki.apache.org/tika/Tika2_0RoadMap> <https://wiki.apache.org/tika/Tika2_0MigrationGuide>, we still have a major breaking change or two planned for Tika 2 that we haven't yet "broken". (There's also removing some deprecated stuff etc) As I understand it, the biggest breaking TODO change is around having multiple parsers available + active for a given format. This could be to support fallback parsers, eg "try this fancy new parser, but if it falls retry with this simpler one" or "try this xml parser, if that fails just try strings". A related but different case is to cleanly support multiple parsers covering different aspects, eg OCR an image plus extract metadata, or NER on the contents of a scientific PDF + text + metadata + NER of the OCR of embedded images in the PDF. Currently, we can't cleanly do the former, and the latter is (badly) handled via one parser (eg OCR or NER) having an embedded hard-code reference to another (eg Image or PDF). We've got some details on the proposed plans and ideas on the wiki: https://wiki.apache.org/tika/CompositeParserDiscussion The biggest stumbling block, as I see it, is how to let multiple parsers interact with the SAX content handler. For the fallback case, that's how to say "sorry, ignore all that XML we already sent, we're starting again with this XML now". For the multiple parser case, it's how we could have the image parser "finish" the (empty) XHTML but then have the OCR one send some text, or have the NER parser get at the XHTML text of the PDF + OCR of embedded images to enhance with the entities. What do we think for this? Can we come up with a solution to let this go forward? Is there a pattern from elsewhere we can follow? Or do we need to cancel this for 2.x, ponder it for another 1-2 years, and do this stuff in Tika 3 instead? Nick
