On Thu, 26 Oct 2017, Chris Mattmann wrote:
Why don’t we just store N copies of the stream, and parse it twice?
I'm not sure that's the challenge though? Using TikaInputStream we can buffer to a temp file if needed to re-read the input
Of course that’s the ugly way, but currently the way I’ve hacked this in all of my projects is simply to call Tika N times OUTSIDE of Tika. Why don’t we just use that as the weakest baseline and work backwards from there?
I think our main challenge right now is on the output end. How do you deal with multiple different Metadata results that might clash after running Tika server times? How do you deal with multiple (some potentially empty, some overlapping) XHTML outputs from multiple parses? Can we copy those approaches?
Thanks Nick
