Johan van der Knijff created TIKA-2461: ------------------------------------------
Summary: Wordperfect file identified as Key: TIKA-2461 URL: https://issues.apache.org/jira/browse/TIKA-2461 Project: Tika Issue Type: Bug Components: detector Affects Versions: 1.16 Environment: Linux Mint 17 Reporter: Johan van der Knijff Priority: Minor While running Tika 1.16 in detect mode over some legacy files from our repository system, I came across one file with a .wpd extension for which Tika reported the following mimetype: {code} application/x-quattro-pro; version=7-8 {code} Opening the file in LibreOffice reveals this is actually a WordPerfect document (not sure about which version; the .WPD extension suggests WP 6 or later). I had a look at the Quattro Pro entry in tika-mimetypes.xml: {code} <mime-type type="application/x-quattro-pro"> <_comment> Quattro Pro - Corel Spreadsheet (part of WordPerfect Office suite) </_comment> <!-- qp2 and wb3 are currently detected by POIFSContainerDetector TODO: add detection for wb2 and wb1 --> <glob pattern="*.qpw"/> <glob pattern="*.wb1"/> <glob pattern="*.wb2"/> <glob pattern="*.wb3"/> </mime-type> {code} This suggests that the problem originates from POIFSContainerDetector. For legal reasons I cannot share the original file. However I was able to create a derived file by truncating the original file after 18 kB, and this derived file shows the same behaviour. The file is available at this link: [tika-identified-as-quattro-pro-truncated.wpd|https://github.com/bitsgalore/shared/raw/master/tika-identified-as-quattro-pro-truncated.wpd] -- This message was sent by Atlassian JIRA (v6.4.14#64029)