Hi, I'm thinking about implementing the (draft) shared MIME database spec [1] from freedesktop.org in Tika as a modern MIME magic implementation to help automatically detect and handle the types of resources where insufficient typing metadata is available. The specified typing information also includes an inheritance model which allows for automatic failover to more generic parsers (e.g. from image/svg to text/xml) when specific parser plugins are not available.
I know that the Java Activation Framework has some of this functionality and that there are a few MIME magic libraries for Java available, but my understanding is that all of these are either not too accurate or unusable in Apache projects due to GPL licensing. I would also like to add an extension point where available parser plugins could register even more accurate custom type detection components. Is such functionality already included or planned in Nutch? Any thoughts, comments or pointers to better get me started? One drawback of the freedesktop.org spec is that their standard MIME type database is GPL licensed so I can't include that directly in the project, but all the major Linux distributions seem to be adopting the standard so the database should be available at least on those platforms without manual installation. [1] http://freedesktop.org/wiki/Standards_2fshared_2dmime_2dinfo_2dspec BR, Jukka Zitting -- Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED] Software craftsmanship, JCR consulting, and Java development
