Y, this is an impressive step forward.  Thank you, Nick!

-----Original Message-----
From: Chris Mattmann [mailto:mattm...@apache.org] 
Sent: Sunday, March 18, 2018 6:00 PM
To: dev@tika.apache.org
Subject: Re: TIKA-1509 (2.x breaking parser change) - ready for first review!

Completely agree, awesome job Nick.

I will definitely try this week as well.

Thank you!

Sincerely,
Chris



On 3/18/18, 2:47 PM, "David Meikle" <loo...@gmail.com> wrote:

    Nice one Nick!  Will take a look this week.
    
    Cheers,
    Dave
    
    On 14 March 2018 at 17:38, Nick Burch <n...@apache.org> wrote:
    
    > Hi All
    >
    > As promised, I've finally had a go to try and implement my ideas for
    > TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion /
    > breaking 2.x parser change
    >
    > My work so far is in this github branch, and is ready for review!
    > https://github.com/apache/tika/tree/multiple-parsers
    >
    >
    > It seems to work fine for the Fallback case, and for the Supplemental
    > case. You can set a policy that controls how clashing metadata is handled,
    > currently "first one to set a key wins", "last one to set a key wins",
    > "ignore previous parsers", and "keep old and new unique values"
    >
    > I've also done a proof of concept for "pick best" case, to try running the
    > text parser with a specified set of different charsets, capture the text
    > from each, "pick the best" (hard coded 1st...) then run for real with that
    > one.
    >
    >
    > Key TODOs - Support InputStreamFactory, properly work out what mimetypes
    > to claim to support, Tika Config XML friendly helper for the metadata 
clash
    > policy, review ContentHandlerFactory signature and tweak if needed.
    >
    > Proposed breaking 2.x change - add second parse method that takes
    > ContentHandlerFactory instead of ContentHandler, with most parsers getting
    > that just grabbing a single one and using that as before
    >
    >
    > Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
    > I stop? Carry on? Modify it? Other?
    >
    > Nick
    >
    



Reply via email to