Hi All
As promised, I've finally had a go to try and implement my ideas for
TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion /
breaking 2.x parser change
My work so far is in this github branch, and is ready for review!
https://github.com/apache/tika/tree/multiple-parsers
It seems to work fine for the Fallback case, and for the Supplemental
case. You can set a policy that controls how clashing metadata is handled,
currently "first one to set a key wins", "last one to set a key wins",
"ignore previous parsers", and "keep old and new unique values"
I've also done a proof of concept for "pick best" case, to try running the
text parser with a specified set of different charsets, capture the text
from each, "pick the best" (hard coded 1st...) then run for real with that
one.
Key TODOs - Support InputStreamFactory, properly work out what mimetypes
to claim to support, Tika Config XML friendly helper for the metadata
clash policy, review ContentHandlerFactory signature and tweak if needed.
Proposed breaking 2.x change - add second parse method that takes
ContentHandlerFactory instead of ContentHandler, with most parsers getting
that just grabbing a single one and using that as before
Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
I stop? Carry on? Modify it? Other?
Nick