Our solution is just to run the parser 2x....yes I get it will induce overhead, 
but as a start, why not?
In short just run through the stream 2x....

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Associate Chief Technology and Innovation Officer, OCIO
Manager, Advanced IT Research and Open Source Projects Office (1761)
Manager, NSF and Open Source Programs and Applications Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-502
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
 
On 2/5/18, 9:25 AM, "Nick Burch" <[email protected]> wrote:

    On Mon, 5 Feb 2018, Chris Mattmann wrote:
    > Let's have a go at implementing it! You know my thoughts (make it like 
    > OODT ;) )\
    
    I'm still keen to hear how we can do the text content like OODT!
    
    I have tried to copy the OODT model for the proposed metadata case though 
    :)
    
    Nick
    
    > On 2/5/18, 8:37 AM, "Nick Burch" <[email protected]> wrote:
    >
    >    Ping - anyone got any thoughts on the proposed metadata parser stuff, 
and
    >    any ideas on the content part?
    >
    >    On Tue, 2 Jan 2018, Nick Burch wrote:
    >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >    >> On collision, the precedence order defines what key takes 
precedence and
    >    >> _overwrites_ the other. Overwrite is but one option (you could save 
*all*
    >    >> the values it’s a multi-valued key structure so…)
    >    >
    >    > OK, I think that's fine. I've had a go at updating the wiki for the 
metadata
    >    > case:
    >    > 
https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
    >    > And example Tika Config settings for it
    >    > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
    >    > If people are happy with how that sounds/looks, I can have a stab at
    >    > implementing it, as I *think* it's quite easy
    >    >
    >    >
    >    > However... that still leaves the Context (XHTML SAX events) case to 
solve!
    >    >
    >    > Anyone have any ideas on how we can append to or cancel/reset the 
Content
    >    > Handler series of SAX events when we move onto a second+ parser for 
a file?
    >    >
    >    > Thanks
    >    > Nick
    >    >
    >    >> On 10/26/17, 9:43 AM, "Nick Burch" <[email protected]> wrote:
    >    >>
    >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >    >>    > My general approach to conflicting metadata is simply to define
    >    >>    > precedence orders.
    >    >>    >
    >    >>    > For example here is one documented from OODT:
    >    >>    >
    >    >>    >
    >    >> 
https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
    >    >>    >
    >    >>    > We can do similar things with Tika, e.g.,
    >    >>    >
    >    >>    > [CoreMetadata.PROPERTIES]
    >    >>    > [ImageParser.METADATA]
    >    >>    > [TikaOCR.METADATA]
    >    >>
    >    >>    What happens if two different parsers both output the same bit of
    >    >> metadata
    >    >>    though? eg Tim's example of one giving dc:creator of Tim and the 
second
    >    >>    giving dc:creator of Chris?
    >    >>
    >    >>
    >    >>    Secondly, what about the XHTML sax events stream? I think that's
    >    >> probably
    >    >>    the harder case...
    >    >>
    >    >>    Nick
    >
    >
    >

Reply via email to