Nick,
It looks like you merged to master, which, I think is the base for
2.0.0-SNAPSHOT. I've been treating branch_1x as the master for 1.x.[1]
Any objections to me cutting 1.18-SNAPSHOT from branch_1x?
Best,
Tim
[1]
https://lists.apache.org/thread.html/12342a115623d157063eb9f40064ccf21561cdab5cfb327f3f368aca@%3Cdev.tika.apache.org%3E
-----Original Message-----
From: Nick Burch [mailto:[email protected]]
Sent: Sunday, April 8, 2018 8:47 AM
To: [email protected]
Subject: Re: TIKA-1509 (2.x breaking parser change) - ready for first review!
In the absense of complaints, I've gone ahead and merged this to Tika's master
branch for 1.x. If I've done it right, there won't be any breaking changes for
1.18, as everything is either new or marked as deprecated pending finalisation.
I haven't merged to 2.x yet, as it'd be good to get some feedback on the
proposed Parser overridden parse method taking a ContentHandlerFactory method
(to go alongside the long-standing ContentHander one for simpler
cases)
Nick
On Sun, 18 Mar 2018, Chris Mattmann wrote:
> Completely agree, awesome job Nick.
>
> I will definitely try this week as well.
>
> Thank you!
>
> Sincerely,
> Chris
>
>
>
> On 3/18/18, 2:47 PM, "David Meikle" <[email protected]> wrote:
>
> Nice one Nick! Will take a look this week.
>
> Cheers,
> Dave
>
> On 14 March 2018 at 17:38, Nick Burch <[email protected]> wrote:
>
> > Hi All
> >
> > As promised, I've finally had a go to try and implement my ideas for
> > TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion /
> > breaking 2.x parser change
> >
> > My work so far is in this github branch, and is ready for review!
> > https://github.com/apache/tika/tree/multiple-parsers
> >
> >
> > It seems to work fine for the Fallback case, and for the Supplemental
> > case. You can set a policy that controls how clashing metadata is
> handled,
> > currently "first one to set a key wins", "last one to set a key wins",
> > "ignore previous parsers", and "keep old and new unique values"
> >
> > I've also done a proof of concept for "pick best" case, to try running
> the
> > text parser with a specified set of different charsets, capture the text
> > from each, "pick the best" (hard coded 1st...) then run for real with
> that
> > one.
> >
> >
> > Key TODOs - Support InputStreamFactory, properly work out what mimetypes
> > to claim to support, Tika Config XML friendly helper for the metadata
> clash
> > policy, review ContentHandlerFactory signature and tweak if needed.
> >
> > Proposed breaking 2.x change - add second parse method that takes
> > ContentHandlerFactory instead of ContentHandler, with most parsers
> getting
> > that just grabbing a single one and using that as before
> >
> >
> > Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
> > I stop? Carry on? Modify it? Other?
> >
> > Nick
> >
>
>
>
>