Re: Not-yet-broken breaking changes for Tika 2?

Nick Burch Thu, 26 Oct 2017 09:14:49 -0700

On Thu, 26 Oct 2017, Chris Mattmann wrote:

Why don’t we just store N copies of the stream, and parse it twice?

I'm not sure that's the challenge though? Using TikaInputStream we canbuffer to a temp file if needed to re-read the input

Of course that’s the ugly way, but currently the way I’ve hacked this inall of my projects is simply to call Tika N times OUTSIDE of Tika. Whydon’t we just use that as the weakest baseline and work backwards fromthere?

I think our main challenge right now is on the output end. How do you dealwith multiple different Metadata results that might clash after runningTika server times? How do you deal with multiple (some potentially empty,some overlapping) XHTML outputs from multiple parses? Can we copy thoseapproaches?


Thanks
Nick

Re: Not-yet-broken breaking changes for Tika 2?

Reply via email to