Let's have a go at implementing it! You know my thoughts (make it like OODT ;) )\
On 2/5/18, 8:37 AM, "Nick Burch" <[email protected]> wrote: Ping - anyone got any thoughts on the proposed metadata parser stuff, and any ideas on the content part? On Tue, 2 Jan 2018, Nick Burch wrote: > On Thu, 26 Oct 2017, Chris Mattmann wrote: >> On collision, the precedence order defines what key takes precedence and >> _overwrites_ the other. Overwrite is but one option (you could save *all* >> the values it’s a multi-valued key structure so…) > > OK, I think that's fine. I've had a go at updating the wiki for the metadata > case: > https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive > And example Tika Config settings for it > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20 > If people are happy with how that sounds/looks, I can have a stab at > implementing it, as I *think* it's quite easy > > > However... that still leaves the Context (XHTML SAX events) case to solve! > > Anyone have any ideas on how we can append to or cancel/reset the Content > Handler series of SAX events when we move onto a second+ parser for a file? > > Thanks > Nick > >> On 10/26/17, 9:43 AM, "Nick Burch" <[email protected]> wrote: >> >> On Thu, 26 Oct 2017, Chris Mattmann wrote: >> > My general approach to conflicting metadata is simply to define >> > precedence orders. >> > >> > For example here is one documented from OODT: >> > >> > >> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence >> > >> > We can do similar things with Tika, e.g., >> > >> > [CoreMetadata.PROPERTIES] >> > [ImageParser.METADATA] >> > [TikaOCR.METADATA] >> >> What happens if two different parsers both output the same bit of >> metadata >> though? eg Tim's example of one giving dc:creator of Tim and the second >> giving dc:creator of Chris? >> >> >> Secondly, what about the XHTML sax events stream? I think that's >> probably >> the harder case... >> >> Nick
