Ping - anyone got any thoughts on the proposed metadata parser stuff, and
any ideas on the content part?
On Tue, 2 Jan 2018, Nick Burch wrote:
On Thu, 26 Oct 2017, Chris Mattmann wrote:
On collision, the precedence order defines what key takes precedence and
_overwrites_ the other. Overwrite is but one option (you could save *all*
the values it’s a multi-valued key structure so…)
OK, I think that's fine. I've had a go at updating the wiki for the metadata
case:
https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
And example Tika Config settings for it
https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
If people are happy with how that sounds/looks, I can have a stab at
implementing it, as I *think* it's quite easy
However... that still leaves the Context (XHTML SAX events) case to solve!
Anyone have any ideas on how we can append to or cancel/reset the Content
Handler series of SAX events when we move onto a second+ parser for a file?
Thanks
Nick
On 10/26/17, 9:43 AM, "Nick Burch" <[email protected]> wrote:
On Thu, 26 Oct 2017, Chris Mattmann wrote:
> My general approach to conflicting metadata is simply to define
> precedence orders.
>
> For example here is one documented from OODT:
>
>
https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>
> We can do similar things with Tika, e.g.,
>
> [CoreMetadata.PROPERTIES]
> [ImageParser.METADATA]
> [TikaOCR.METADATA]
What happens if two different parsers both output the same bit of
metadata
though? eg Tim's example of one giving dc:creator of Tim and the second
giving dc:creator of Chris?
Secondly, what about the XHTML sax events stream? I think that's
probably
the harder case...
Nick