Hi,

On Sun, May 17, 2009 at 11:20 AM, Robert Burrell Donkin
<[email protected]> wrote:
> a couple of issues that suggest that this might be better than jumping
> to tika right away:
>
> 1. in terms of interface reuse ATM tika trunk doesn't offer a minimal api[1]
>
> [1] IMHO breaking out a tika api module with minimal dependencies would
> encourage wider use of tika's basic abstractions

That's right. Tika is still in progress of cleaning up many of its
internal abstractions (type registry, metadata handling, etc.), so it
might be premature to use them as-is outside Tika at this point. By
Tika 1.0 I believe the abstractions being in a pretty solid state.

Anyway, quite a bit of thought has already gone into the Detector
interface, so it would be useful to at least borrow some of the key
parts. One pretty essential piece is the way the InputStream argument
is handled. With the explicitly specified mark/reset behaviour it's
possible for a detector to easily delegate the detection to multiple
different sub-detectors. Also, each detector gets to decide how much
of the input stream they need to read to detect the file type.

> 2. the latest release (tika 0.3) is not modular

Correct, that'll be solved in Tika 0.4. I'm sure we can cut the 0.4
release pretty soon if there's demand from RAT.

BR,

Jukka Zitting

Reply via email to