Content-type detection for Tika

Jukka Zitting Wed, 06 Sep 2006 02:36:55 -0700

Hi,

I'm thinking about implementing the (draft) shared MIME database spec
[1] from freedesktop.org in Tika as a modern MIME magic implementation
to help automatically detect and handle the types of resources where
insufficient typing metadata is available. The specified typing
information also includes an inheritance model which allows for
automatic failover to more generic parsers (e.g. from image/svg to
text/xml) when specific parser plugins are not available.


I know that the Java Activation Framework has some of this
functionality and that there are a few MIME magic libraries for Java
available, but my understanding is that all of these are either not
too accurate or unusable in Apache projects due to GPL licensing. I
would also like to add an extension point where available parser
plugins could register even more accurate custom type detection
components.

Is such functionality already included or planned in Nutch? Any
thoughts, comments or pointers to better get me started?

One drawback of the freedesktop.org spec is that their standard MIME
type database is GPL licensed so I can't include that directly in the
project, but all the major Linux distributions seem to be adopting the
standard so the database should be available at least on those
platforms without manual installation.

[1] http://freedesktop.org/wiki/Standards_2fshared_2dmime_2dinfo_2dspec

BR,

Jukka Zitting

--
Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED]
Software craftsmanship, JCR consulting, and Java development

Content-type detection for Tika

Reply via email to