Thanks Nick. My general approach to conflicting metadata is simply to define precedence orders.
For example here is one documented from OODT: https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence We can do similar things with Tika, e.g., [CoreMetadata.PROPERTIES] [ImageParser.METADATA] [TikaOCR.METADATA] … And then start with the top, and then overlay heading downwards. Make sense? Cheers, Chris P.S. The metadata key/value merging principles could be configurable, but a default base one of overlay according to some configured precedence order maybe in tika-config.xml would be a fine start. On 10/26/17, 9:14 AM, "Nick Burch" <[email protected]> wrote: On Thu, 26 Oct 2017, Chris Mattmann wrote: > Why don’t we just store N copies of the stream, and parse it twice? I'm not sure that's the challenge though? Using TikaInputStream we can buffer to a temp file if needed to re-read the input > Of course that’s the ugly way, but currently the way I’ve hacked this in > all of my projects is simply to call Tika N times OUTSIDE of Tika. Why > don’t we just use that as the weakest baseline and work backwards from > there? I think our main challenge right now is on the output end. How do you deal with multiple different Metadata results that might clash after running Tika server times? How do you deal with multiple (some potentially empty, some overlapping) XHTML outputs from multiple parses? Can we copy those approaches? Thanks Nick
