[jira] [Commented] (NUTCH-1605) mime type detector recognizes xlsx as zip file
[ https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13972053#comment-13972053 ] Sebastian Nagel commented on NUTCH-1605: Changes to MIME magic may result in subtle problems: please, test (trunk or 2.x)! mime type detector recognizes xlsx as zip file -- Key: NUTCH-1605 URL: https://issues.apache.org/jira/browse/NUTCH-1605 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.7, 2.2.1 Reporter: Sebastian Nagel Fix For: 2.3, 1.9 Attachments: NUTCH-1605-trunk-v1.patch, NUTCH-1605-trunk-v2.patch, test.xlsx With {{mime.type.magic}} as true (the default) Office Open XML spreadsheets (*.xlsx) are treated as zip files and not parsed correctly: {code} % bin/nutch parsechecker http://localhost/test.xlsx fetching: http://localhost/test.xlsx parsing: http://localhost/test.xlsx contentType: application/zip ... {code} Xlsx files are formally zip files. Nevertheless, both HTTP header and file name are clear: {code} % wget -d http://localhost/test.xlsx ... HTTP/1.1 200 OK ... Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet ... {code} Tika 1.4 detects the type correctly: {code} % java -jar tika-app-1.4.jar -d http://localhost/test/test.xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1605) mime type detector recognizes xlsx as zip file
[ https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943491#comment-13943491 ] Sebastian Nagel commented on NUTCH-1605: Patch also applies to 2.x mime type detector recognizes xlsx as zip file -- Key: NUTCH-1605 URL: https://issues.apache.org/jira/browse/NUTCH-1605 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Sebastian Nagel Attachments: NUTCH-1605-trunk-v1.patch, NUTCH-1605-trunk-v2.patch, test.xlsx With {{mime.type.magic}} as true (the default) Office Open XML spreadsheets (*.xlsx) are treated as zip files and not parsed correctly: {code} % bin/nutch parsechecker http://localhost/test.xlsx fetching: http://localhost/test.xlsx parsing: http://localhost/test.xlsx contentType: application/zip ... {code} Xlsx files are formally zip files. Nevertheless, both HTTP header and file name are clear: {code} % wget -d http://localhost/test.xlsx ... HTTP/1.1 200 OK ... Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet ... {code} Tika 1.4 detects the type correctly: {code} % java -jar tika-app-1.4.jar -d http://localhost/test/test.xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1605) mime type detector recognizes xlsx as zip file
[ https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707206#comment-13707206 ] Lewis John McGibbney commented on NUTCH-1605: - I actually noticed something like this when I was using Tika 1.2 in with Any23 recently. I'll see if I can reproduce and also see if and where I can draw parallels with this one Seb. mime type detector recognizes xlsx as zip file -- Key: NUTCH-1605 URL: https://issues.apache.org/jira/browse/NUTCH-1605 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.7 Reporter: Sebastian Nagel Attachments: test.xlsx With {{mime.type.magic}} as true (the default) Office Open XML spreadsheets (*.xlsx) are treated as zip files and not parsed correctly: {code} % bin/nutch parsechecker http://localhost/test.xlsx fetching: http://localhost/test.xlsx parsing: http://localhost/test.xlsx contentType: application/zip ... {code} Xlsx files are formally zip files. Nevertheless, both HTTP header and file name are clear: {code} % wget -d http://localhost/test.xlsx ... HTTP/1.1 200 OK ... Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet ... {code} Tika 1.4 detects the type correctly: {code} % java -jar tika-app-1.4.jar -d http://localhost/test/test.xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira