Jukka Zitting wrote:
> Hi,
> 
> On Sun, May 17, 2009 at 11:20 AM, Robert Burrell Donkin
> <[email protected]> wrote:
>> a couple of issues that suggest that this might be better than jumping
>> to tika right away:
>>
>> 1. in terms of interface reuse ATM tika trunk doesn't offer a minimal api[1]
>>
>> [1] IMHO breaking out a tika api module with minimal dependencies would
>> encourage wider use of tika's basic abstractions
> 
> That's right. Tika is still in progress of cleaning up many of its
> internal abstractions (type registry, metadata handling, etc.), so it
> might be premature to use them as-is outside Tika at this point. By
> Tika 1.0 I believe the abstractions being in a pretty solid state.
> 
> Anyway, quite a bit of thought has already gone into the Detector
> interface, so it would be useful to at least borrow some of the key
> parts. One pretty essential piece is the way the InputStream argument
> is handled. With the explicitly specified mark/reset behaviour it's
> possible for a detector to easily delegate the detection to multiple
> different sub-detectors. Also, each detector gets to decide how much
> of the input stream they need to read to detect the file type.

my plan is to pattern the API on the ones in tika and see how it works

>> 2. the latest release (tika 0.3) is not modular
> 
> Correct, that'll be solved in Tika 0.4. I'm sure we can cut the 0.4
> release pretty soon if there's demand from RAT.

it's possible that RAT has some MIME typing heuristics that need to be
fed back upstream. so, probably best to wait a little while at least.

- robert

Reply via email to