Ping - anyone got any thoughts on the proposed metadata parser stuff, and any ideas on the content part?

On Tue, 2 Jan 2018, Nick Burch wrote:
On Thu, 26 Oct 2017, Chris Mattmann wrote:
On collision, the precedence order defines what key takes precedence and _overwrites_ the other. Overwrite is but one option (you could save *all* the values it’s a multi-valued key structure so…)

OK, I think that's fine. I've had a go at updating the wiki for the metadata case:
https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
And example Tika Config settings for it
https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
If people are happy with how that sounds/looks, I can have a stab at implementing it, as I *think* it's quite easy


However... that still leaves the Context (XHTML SAX events) case to solve!

Anyone have any ideas on how we can append to or cancel/reset the Content Handler series of SAX events when we move onto a second+ parser for a file?

Thanks
Nick

On 10/26/17, 9:43 AM, "Nick Burch" <[email protected]> wrote:

   On Thu, 26 Oct 2017, Chris Mattmann wrote:
   > My general approach to conflicting metadata is simply to define
   > precedence orders.
   >
   > For example here is one documented from OODT:
   >
> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
   >
   > We can do similar things with Tika, e.g.,
   >
   > [CoreMetadata.PROPERTIES]
   > [ImageParser.METADATA]
   > [TikaOCR.METADATA]

What happens if two different parsers both output the same bit of metadata
   though? eg Tim's example of one giving dc:creator of Tim and the second
   giving dc:creator of Chris?


Secondly, what about the XHTML sax events stream? I think that's probably
   the harder case...

   Nick

Reply via email to