At this point, I'm willing to punt to 3.x, unless there's momentum for either 
of these two.  They would be great to have!

1) chaining multiple parsers -- additive
This shouldn't be too bad, except where there's conflicting metadata -- parser1 
says author is 'bob', parser2 says author is 'alice'.  We would break some 
uniqueness guarantees for some Properties that should only allow a single value 
if we added those values...  Overwriting feels like a bad idea.  Perhaps we 
remove the uniqueness guarantees when in "additive" mode ... or let users 
select additive/overwrite?

2) fallback parsers 
>The biggest stumbling block, as I see it, is how to let multiple parsers 
>interact with the SAX content handler. For the fallback case, that's how to 
>say "sorry, ignore all that XML we already sent, we're starting again with 
>this XML now".

Y, this has been what's holding me back.  How do we create a resettable handler 
that doesn't have us mucking too much with all of our current handlers.  For 
those with outputstreams/writers,  I imagine we'd require a resettable 
OutputStream...TikaOutputStream(?)

TikaOutputStream() --underling stringwriter, when reset, would just be a new 
stringwriter on reset() ??? Not quite right...
TikaOutputStream.get(Path/File) -- would hold the underlying file/path, close 
the writer, and just rewrite on reset()
TikaOutputStream.get(ByteArrayOutputStream)  baos has a reset() so that should 
work...

What other use cases?




-----Original Message-----
From: Nick Burch [mailto:[email protected]] 
Sent: Thursday, October 26, 2017 6:57 AM
To: [email protected]
Subject: Not-yet-broken breaking changes for Tika 2?

Hi All

Based on the plan on the wiki
<https://wiki.apache.org/tika/Tika2_0RoadMap>
<https://wiki.apache.org/tika/Tika2_0MigrationGuide>, we still have a major 
breaking change or two planned for Tika 2 that we haven't yet "broken". 
(There's also removing some deprecated stuff etc)


As I understand it, the biggest breaking TODO change is around having multiple 
parsers available + active for a given format. This could be to support 
fallback parsers, eg "try this fancy new parser, but if it falls retry with 
this simpler one" or "try this xml parser, if that fails just try strings". A 
related but different case is to cleanly support multiple parsers covering 
different aspects, eg OCR an image plus extract metadata, or NER on the 
contents of a scientific PDF + text + metadata + NER of the OCR of embedded 
images in the PDF.

Currently, we can't cleanly do the former, and the latter is (badly) handled 
via one parser (eg OCR or NER) having an embedded hard-code reference to 
another (eg Image or PDF).


We've got some details on the proposed plans and ideas on the wiki:
https://wiki.apache.org/tika/CompositeParserDiscussion

The biggest stumbling block, as I see it, is how to let multiple parsers 
interact with the SAX content handler. For the fallback case, that's how to say 
"sorry, ignore all that XML we already sent, we're starting again with this XML 
now". For the multiple parser case, it's how we could have the image parser 
"finish" the (empty) XHTML but then have the OCR one send some text, or have 
the NER parser get at the XHTML text of the PDF + OCR of embedded images to 
enhance with the entities.


What do we think for this? Can we come up with a solution to let this go 
forward? Is there a pattern from elsewhere we can follow?

Or do we need to cancel this for 2.x, ponder it for another 1-2 years, and do 
this stuff in Tika 3 instead?

Nick

Reply via email to