Y, this is an impressive step forward. Thank you, Nick! -----Original Message----- From: Chris Mattmann [mailto:mattm...@apache.org] Sent: Sunday, March 18, 2018 6:00 PM To: dev@tika.apache.org Subject: Re: TIKA-1509 (2.x breaking parser change) - ready for first review!
Completely agree, awesome job Nick. I will definitely try this week as well. Thank you! Sincerely, Chris On 3/18/18, 2:47 PM, "David Meikle" <loo...@gmail.com> wrote: Nice one Nick! Will take a look this week. Cheers, Dave On 14 March 2018 at 17:38, Nick Burch <n...@apache.org> wrote: > Hi All > > As promised, I've finally had a go to try and implement my ideas for > TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion / > breaking 2.x parser change > > My work so far is in this github branch, and is ready for review! > https://github.com/apache/tika/tree/multiple-parsers > > > It seems to work fine for the Fallback case, and for the Supplemental > case. You can set a policy that controls how clashing metadata is handled, > currently "first one to set a key wins", "last one to set a key wins", > "ignore previous parsers", and "keep old and new unique values" > > I've also done a proof of concept for "pick best" case, to try running the > text parser with a specified set of different charsets, capture the text > from each, "pick the best" (hard coded 1st...) then run for real with that > one. > > > Key TODOs - Support InputStreamFactory, properly work out what mimetypes > to claim to support, Tika Config XML friendly helper for the metadata clash > policy, review ContentHandlerFactory signature and tweak if needed. > > Proposed breaking 2.x change - add second parse method that takes > ContentHandlerFactory instead of ContentHandler, with most parsers getting > that just grabbing a single one and using that as before > > > Before I do any more though... Thoughts? Comments? Ideas? Changes? Should > I stop? Carry on? Modify it? Other? > > Nick >