[ https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707206#comment-13707206 ]
Lewis John McGibbney commented on NUTCH-1605: --------------------------------------------- I actually noticed something like this when I was using Tika 1.2 in with Any23 recently. I'll see if I can reproduce and also see if and where I can draw parallels with this one Seb. > mime type detector recognizes xlsx as zip file > ---------------------------------------------- > > Key: NUTCH-1605 > URL: https://issues.apache.org/jira/browse/NUTCH-1605 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.7 > Reporter: Sebastian Nagel > Attachments: test.xlsx > > > With {{mime.type.magic}} as true (the default) Office Open XML spreadsheets > (*.xlsx) are treated as zip files and not parsed correctly: > {code} > % bin/nutch parsechecker http://localhost/test.xlsx > fetching: http://localhost/test.xlsx > parsing: http://localhost/test.xlsx > contentType: application/zip > ... > {code} > Xlsx files are formally zip files. Nevertheless, both HTTP header and file > name are clear: > {code} > % wget -d http://localhost/test.xlsx > ... > HTTP/1.1 200 OK > ... > Content-Type: > application/vnd.openxmlformats-officedocument.spreadsheetml.sheet > ... > {code} > Tika 1.4 detects the type correctly: > {code} > % java -jar tika-app-1.4.jar -d http://localhost/test/test.xlsx > application/vnd.openxmlformats-officedocument.spreadsheetml.sheet > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira