On the metadata stuff, I'm coming around to Ray Gauss's proposal. I wanted too much back then, and his solution is super elegant, IIRC.
-----Original Message----- From: Nick Burch [mailto:[email protected]] Sent: Monday, February 5, 2018 11:37 AM To: [email protected] Subject: Re: Not-yet-broken breaking changes for Tika 2? Ping - anyone got any thoughts on the proposed metadata parser stuff, and any ideas on the content part? On Tue, 2 Jan 2018, Nick Burch wrote: > On Thu, 26 Oct 2017, Chris Mattmann wrote: >> On collision, the precedence order defines what key takes precedence >> and _overwrites_ the other. Overwrite is but one option (you could >> save *all* the values it’s a multi-valued key structure so…) > > OK, I think that's fine. I've had a go at updating the wiki for the > metadata > case: > https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2 > FAdditive And example Tika Config settings for it > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20 > If people are happy with how that sounds/looks, I can have a stab at > implementing it, as I *think* it's quite easy > > > However... that still leaves the Context (XHTML SAX events) case to solve! > > Anyone have any ideas on how we can append to or cancel/reset the > Content Handler series of SAX events when we move onto a second+ parser for a > file? > > Thanks > Nick > >> On 10/26/17, 9:43 AM, "Nick Burch" <[email protected]> wrote: >> >> On Thu, 26 Oct 2017, Chris Mattmann wrote: >> > My general approach to conflicting metadata is simply to define >> > precedence orders. >> > >> > For example here is one documented from OODT: >> > >> > >> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence >> > >> > We can do similar things with Tika, e.g., >> > >> > [CoreMetadata.PROPERTIES] >> > [ImageParser.METADATA] >> > [TikaOCR.METADATA] >> >> What happens if two different parsers both output the same bit of >> metadata >> though? eg Tim's example of one giving dc:creator of Tim and the second >> giving dc:creator of Chris? >> >> >> Secondly, what about the XHTML sax events stream? I think that's >> probably >> the harder case... >> >> Nick
