Thanks Nick.

My general approach to conflicting metadata is simply to define precedence 
orders.

For example here is one documented from OODT:

https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
 

We can do similar things with Tika, e.g.,

[CoreMetadata.PROPERTIES]
[ImageParser.METADATA]
[TikaOCR.METADATA]
…

And then start with the top, and then overlay heading downwards. Make sense?

Cheers,
Chris

P.S. The metadata key/value merging principles could be configurable, but a 
default base one of
overlay according to some configured precedence order maybe in tika-config.xml 
would be a fine
start.




On 10/26/17, 9:14 AM, "Nick Burch" <[email protected]> wrote:

    On Thu, 26 Oct 2017, Chris Mattmann wrote:
    > Why don’t we just store N copies of the stream, and parse it twice?
    
    I'm not sure that's the challenge though? Using TikaInputStream we can 
    buffer to a temp file if needed to re-read the input
    
    > Of course that’s the ugly way, but currently the way I’ve hacked this in 
    > all of my projects is simply to call Tika N times OUTSIDE of Tika. Why 
    > don’t we just use that as the weakest baseline and work backwards from 
    > there?
    
    I think our main challenge right now is on the output end. How do you deal 
    with multiple different Metadata results that might clash after running 
    Tika server times? How do you deal with multiple (some potentially empty, 
    some overlapping) XHTML outputs from multiple parses? Can we copy those 
    approaches?
    
    Thanks
    Nick


Reply via email to