Hi, On Sun, May 17, 2009 at 11:20 AM, Robert Burrell Donkin <[email protected]> wrote: > a couple of issues that suggest that this might be better than jumping > to tika right away: > > 1. in terms of interface reuse ATM tika trunk doesn't offer a minimal api[1] > > [1] IMHO breaking out a tika api module with minimal dependencies would > encourage wider use of tika's basic abstractions
That's right. Tika is still in progress of cleaning up many of its internal abstractions (type registry, metadata handling, etc.), so it might be premature to use them as-is outside Tika at this point. By Tika 1.0 I believe the abstractions being in a pretty solid state. Anyway, quite a bit of thought has already gone into the Detector interface, so it would be useful to at least borrow some of the key parts. One pretty essential piece is the way the InputStream argument is handled. With the explicitly specified mark/reset behaviour it's possible for a detector to easily delegate the detection to multiple different sub-detectors. Also, each detector gets to decide how much of the input stream they need to read to detect the file type. > 2. the latest release (tika 0.3) is not modular Correct, that'll be solved in Tika 0.4. I'm sure we can cut the 0.4 release pretty soon if there's demand from RAT. BR, Jukka Zitting
