Jukka Zitting wrote: > Hi, > > On Sun, May 17, 2009 at 11:20 AM, Robert Burrell Donkin > <[email protected]> wrote: >> a couple of issues that suggest that this might be better than jumping >> to tika right away: >> >> 1. in terms of interface reuse ATM tika trunk doesn't offer a minimal api[1] >> >> [1] IMHO breaking out a tika api module with minimal dependencies would >> encourage wider use of tika's basic abstractions > > That's right. Tika is still in progress of cleaning up many of its > internal abstractions (type registry, metadata handling, etc.), so it > might be premature to use them as-is outside Tika at this point. By > Tika 1.0 I believe the abstractions being in a pretty solid state. > > Anyway, quite a bit of thought has already gone into the Detector > interface, so it would be useful to at least borrow some of the key > parts. One pretty essential piece is the way the InputStream argument > is handled. With the explicitly specified mark/reset behaviour it's > possible for a detector to easily delegate the detection to multiple > different sub-detectors. Also, each detector gets to decide how much > of the input stream they need to read to detect the file type.
my plan is to pattern the API on the ones in tika and see how it works >> 2. the latest release (tika 0.3) is not modular > > Correct, that'll be solved in Tika 0.4. I'm sure we can cut the 0.4 > release pretty soon if there's demand from RAT. it's possible that RAT has some MIME typing heuristics that need to be fed back upstream. so, probably best to wait a little while at least. - robert
