Hi Antoni,
>
> The tika mime type detection code has improved greatly since I last
> looked it a while ago. The root-XML-based detection and
> ContainerAwareDetector are things we (Aperture) have wanted to do
> ourselves since at least 2007 but never got round to it :)
Thanks!
>
> Unfortunately there are many subtle differences between the mime
> definition files which would break existing Aperture applications.
> Therefore I'd like to implement a temporary solution that would work in
> the interim and allow for gradual migration.
>
> first create a normal MimeTypes
> mimeTypes = MimeTypesFactory.create("tika-mimetypes.xml");
>
> then delete some definitions with
> mimeTypes.deleteMimeType("application/vnd.ms-outlook")
> // in tika this is an msg file
> // in aperture this is a pst file - clearly wrong, but...
>
> and then read our definitions file
> new MimeTypesReader(mimeTypes).read(inputStreamFromOurFile);
>
> Questions:
> 0. Does this make sense? Am I missing something?
It makes sense if you want to programmatically manage the media types rather
than curate them in XML outside of the Tika application. Another option
would be to provide easy means for refreshing the Detector interface for a
Parser, or just in general (it's possible to do this now but involves lower
level APIs that should probably be better insulated).
> 1. there is no deleteMimeType method. Is it possible to delete a mime
> type definition from a MimeTypes instance? I just wanted to ask before
> trying to implement it myself.
Yeah there isn't a deleteMimeType, or editMimeType. We never really provided
CRUD type operations, just what was needed from a reader perspective. Maybe
it makes sense to implement this now, but it would be great to not clutter
the existing reader-focused APIs with these methods and instead to create
like a MimeTypesWriter interface, or MimeTypesEditor interface and put those
methods there.
> 2. the MimeTypesReader class is not public. Is there any particular
> reason for that? The code seems to augment, not replace the definitions
> so it seems suitable for our use case, but the reader is not public.
Yeah, same rationale as for #1 on this.
> 3. It seems that there is a rule that all minor types either begin with
> x- or are IANA-approved. Please confirm.
That's the way we've curated so far. But that's Tika's approach doesn't mean
that it's the only (or most correct) one.
> 4. It also seems that your mime definition file is not related to the
> one at freedesktop.org, I mean, there are no policies like "First submit
> to freedesktop, wait until they approve and commit and then update the
> tika definitions". Please confirm.
It's related in that they are formatted similarly. However, the process for
curating media types is insulated from outside entities, which I actually
see as a very good thing. That way Tika can serve to bring together existing
curation efforts, but not let those bog down the ability to move forward
with code, and with writing applications that take advantage of these
features.
HTH,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++