On Mon, 5 Feb 2018, Chris Mattmann wrote:
Let's have a go at implementing it! You know my thoughts (make it like
OODT ;) )\
I'm still keen to hear how we can do the text content like OODT!
I have tried to copy the OODT model for the proposed metadata case though
:)
Nick
On 2/5/18, 8:37 AM, "Nick Burch" <[email protected]> wrote:
Ping - anyone got any thoughts on the proposed metadata parser stuff, and
any ideas on the content part?
On Tue, 2 Jan 2018, Nick Burch wrote:
> On Thu, 26 Oct 2017, Chris Mattmann wrote:
>> On collision, the precedence order defines what key takes precedence and
>> _overwrites_ the other. Overwrite is but one option (you could save *all*
>> the values it’s a multi-valued key structure so…)
>
> OK, I think that's fine. I've had a go at updating the wiki for the
metadata
> case:
>
https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
> And example Tika Config settings for it
> https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
> If people are happy with how that sounds/looks, I can have a stab at
> implementing it, as I *think* it's quite easy
>
>
> However... that still leaves the Context (XHTML SAX events) case to solve!
>
> Anyone have any ideas on how we can append to or cancel/reset the Content
> Handler series of SAX events when we move onto a second+ parser for a file?
>
> Thanks
> Nick
>
>> On 10/26/17, 9:43 AM, "Nick Burch" <[email protected]> wrote:
>>
>> On Thu, 26 Oct 2017, Chris Mattmann wrote:
>> > My general approach to conflicting metadata is simply to define
>> > precedence orders.
>> >
>> > For example here is one documented from OODT:
>> >
>> >
>>
https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>> >
>> > We can do similar things with Tika, e.g.,
>> >
>> > [CoreMetadata.PROPERTIES]
>> > [ImageParser.METADATA]
>> > [TikaOCR.METADATA]
>>
>> What happens if two different parsers both output the same bit of
>> metadata
>> though? eg Tim's example of one giving dc:creator of Tim and the second
>> giving dc:creator of Chris?
>>
>>
>> Secondly, what about the XHTML sax events stream? I think that's
>> probably
>> the harder case...
>>
>> Nick