On Mon, 5 Feb 2018, Chris Mattmann wrote:
Let's have a go at implementing it! You know my thoughts (make it like OODT ;) )\

I'm still keen to hear how we can do the text content like OODT!

I have tried to copy the OODT model for the proposed metadata case though :)

Nick

On 2/5/18, 8:37 AM, "Nick Burch" <[email protected]> wrote:

   Ping - anyone got any thoughts on the proposed metadata parser stuff, and
   any ideas on the content part?

   On Tue, 2 Jan 2018, Nick Burch wrote:
   > On Thu, 26 Oct 2017, Chris Mattmann wrote:
   >> On collision, the precedence order defines what key takes precedence and
   >> _overwrites_ the other. Overwrite is but one option (you could save *all*
   >> the values it’s a multi-valued key structure so…)
   >
   > OK, I think that's fine. I've had a go at updating the wiki for the 
metadata
   > case:
   > 
https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
   > And example Tika Config settings for it
   > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
   > If people are happy with how that sounds/looks, I can have a stab at
   > implementing it, as I *think* it's quite easy
   >
   >
   > However... that still leaves the Context (XHTML SAX events) case to solve!
   >
   > Anyone have any ideas on how we can append to or cancel/reset the Content
   > Handler series of SAX events when we move onto a second+ parser for a file?
   >
   > Thanks
   > Nick
   >
   >> On 10/26/17, 9:43 AM, "Nick Burch" <[email protected]> wrote:
   >>
   >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
   >>    > My general approach to conflicting metadata is simply to define
   >>    > precedence orders.
   >>    >
   >>    > For example here is one documented from OODT:
   >>    >
   >>    >
   >> 
https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
   >>    >
   >>    > We can do similar things with Tika, e.g.,
   >>    >
   >>    > [CoreMetadata.PROPERTIES]
   >>    > [ImageParser.METADATA]
   >>    > [TikaOCR.METADATA]
   >>
   >>    What happens if two different parsers both output the same bit of
   >> metadata
   >>    though? eg Tim's example of one giving dc:creator of Tim and the second
   >>    giving dc:creator of Chris?
   >>
   >>
   >>    Secondly, what about the XHTML sax events stream? I think that's
   >> probably
   >>    the harder case...
   >>
   >>    Nick


Reply via email to