Hello Aperture (cc tika-dev, may be interesting for you too)
As you know Tika has made certain advances in the field of mime type identification, which we (Aperture) wanted to implement for a long time. This is the feature request 3043080 but it applies to a bug 3025427 and feature requests: 2210328 (ZipContainerDetector), 1838840 and 1650532 (root-XML-based detection). The oldest is almost 4 years old. That's why I decided to explore the idea of an implementation of the Aperture MimeTypeIdentifier interface, which would delegate the actual identification to Tika ContainerAwareDetector backed by Tika MimeTypes class. I worked in aperture-addons, and now, I moved this to aperture-core, to be included in the next release. This turned out to be (much) more complex than I thought. There were certain files which Tika recognized better, and certain that Aperture recognized better. I submitted 7 issues to Tika JIRA and prepared a little hack that allowed me to augment the tika-mimetypes.xml with the knowledge from our mimetypes.xml file. As of now the only things that the MagicMimeTypeIdentifier does better than TikaMimeTypeIdentifier are: - support for string patterns in UTF-16 documents. E.g. Tika can't recognize XML, or HTML in a full UTF-16 file - support for allowsWhiteSpace before a pattern, e.g. Tika had problems recognizing the <html> tag if there is some whitespace in front of it (now it works around that limitation in a good enough way though, so it's actually not a problem) - support for multiple parent types. - quattro pro 6 used a wordperfect magic, while later ones used office magics, - older Corel Presentations used wordperfect magic, newer use office, - works spreadsheets 3.0 used a wordperfect magic, 4.0 used their own format, 7.0 uses office The problem with Tika, is that it treats all those cases correctly when only the name is provided, but when both name and bytes are provided, the byte-based mime type trumps the name-based mime type, because name-based is not a specialization of byte-based (because one type can only have a single parent, so if we say that office is the parent of works, we won't recognize works 3.0 and 4.0 but only 7.0). - getExtensionsFor(String mimeType), useful in many apps, in tika the the mime knowledge base is hidden in private fields and package-protected classes Yet apart from these minor inconveniences, all of which will probably disappear in near future, Tika brings benefits - more mime type descriptions, - "correct" names, either IANA-approved, or "proper" vendor-made starting with "vnd." or "invented" ones starting with "x-" - detection based on root XML element (at last we can correctly detect XHTML docs with <?xml version="1.0" encoding="utf-8"?> header) - better detection of OOXML and OLE docs without a name (thanks to ZipContainerDetector and PoiContainerDetector), though only slightly, the ContainerAwareDetector works best with a full file, but we give it only the first 8KB - better plaintext detection, and a couple of other improvements I made TikaMimeTypeIdentifier the default choice in ApertureRuntime and in Aperture's Example Application. Existing apps, which use the MagicMimeTypeIdentifier will not see any difference, though their authors are advised to take a look at the new implementation. The new MimeTypeIdentifier uses different names for many mime types. In most cases these different names are "better", yet they are different and might require a modification of the client code. Fixing the four limitations outlined above will require additional patches to Tika. I wanted to "release" the code now, to allow for testing, before the next Aperture release. In the long term, I think that maintaining two separate mime type identifiers is a bad idea. So, play with the ApertureRuntime, or the CLI apps, and try to substitute "new MagicMimeTypeIdentifier" with "new TikaMimeTypeIdentifier()" and see what happens. Links: The file with mime type info which was present in Aperture's mimetypes-xml, but not in tika-mimetypes.xml https://aperture.svn.sourceforge.net/svnroot/aperture/aperture/trunk/core/src/main/resources/org/semanticdesktop/aperture/tika/diff-mimetypes.xml A diff between these two files, shows the differences in mimetype identification. Aperture identification (name, identification by data, identification by name and data): https://aperture.svn.sourceforge.net/svnroot/aperture/aperture/trunk/core/src/test/java/org/semanticdesktop/aperture/mime/identifier/magic/ApertureDocumentsIdentificationTest.java Tika-based identification (only 8KB of each file is taken into account, tika-mimetypes.xml is enhanced via MimeTypesEnhancer with the content of diff-mimetypes.xml) https://aperture.svn.sourceforge.net/svnroot/aperture/aperture/trunk/core/src/test/java/org/semanticdesktop/aperture/tika/TikaMimeTypeIdentifierTest.java -- Antoni Myłka [email protected]
