[jira] [Commented] (NUTCH-1605) mime type detector recognizes xlsx as zip file

2014-04-16 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13972053#comment-13972053
 ] 

Sebastian Nagel commented on NUTCH-1605:


Changes to MIME magic may result in subtle problems: please, test (trunk or 
2.x)!

 mime type detector recognizes xlsx as zip file
 --

 Key: NUTCH-1605
 URL: https://issues.apache.org/jira/browse/NUTCH-1605
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7, 2.2.1
Reporter: Sebastian Nagel
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1605-trunk-v1.patch, NUTCH-1605-trunk-v2.patch, 
 test.xlsx


 With {{mime.type.magic}} as true (the default) Office Open XML spreadsheets 
 (*.xlsx) are treated as zip files and not parsed correctly:
 {code}
 % bin/nutch parsechecker http://localhost/test.xlsx
 fetching: http://localhost/test.xlsx
 parsing: http://localhost/test.xlsx
 contentType: application/zip
 ...
 {code}
 Xlsx files are formally zip files. Nevertheless, both HTTP header and file 
 name are clear:
 {code}
 % wget -d http://localhost/test.xlsx
 ...
 HTTP/1.1 200 OK
 ...
 Content-Type: 
 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
 ...
 {code}
 Tika 1.4 detects the type correctly:
 {code}
 % java -jar tika-app-1.4.jar -d http://localhost/test/test.xlsx
 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1605) mime type detector recognizes xlsx as zip file

2014-03-21 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943491#comment-13943491
 ] 

Sebastian Nagel commented on NUTCH-1605:


Patch also applies to 2.x

 mime type detector recognizes xlsx as zip file
 --

 Key: NUTCH-1605
 URL: https://issues.apache.org/jira/browse/NUTCH-1605
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Sebastian Nagel
 Attachments: NUTCH-1605-trunk-v1.patch, NUTCH-1605-trunk-v2.patch, 
 test.xlsx


 With {{mime.type.magic}} as true (the default) Office Open XML spreadsheets 
 (*.xlsx) are treated as zip files and not parsed correctly:
 {code}
 % bin/nutch parsechecker http://localhost/test.xlsx
 fetching: http://localhost/test.xlsx
 parsing: http://localhost/test.xlsx
 contentType: application/zip
 ...
 {code}
 Xlsx files are formally zip files. Nevertheless, both HTTP header and file 
 name are clear:
 {code}
 % wget -d http://localhost/test.xlsx
 ...
 HTTP/1.1 200 OK
 ...
 Content-Type: 
 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
 ...
 {code}
 Tika 1.4 detects the type correctly:
 {code}
 % java -jar tika-app-1.4.jar -d http://localhost/test/test.xlsx
 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1605) mime type detector recognizes xlsx as zip file

2013-07-12 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707206#comment-13707206
 ] 

Lewis John McGibbney commented on NUTCH-1605:
-

I actually noticed something like this when I was using Tika 1.2 in with Any23 
recently. I'll see if I can reproduce and also see if and where I can draw 
parallels with this one Seb.

 mime type detector recognizes xlsx as zip file
 --

 Key: NUTCH-1605
 URL: https://issues.apache.org/jira/browse/NUTCH-1605
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Sebastian Nagel
 Attachments: test.xlsx


 With {{mime.type.magic}} as true (the default) Office Open XML spreadsheets 
 (*.xlsx) are treated as zip files and not parsed correctly:
 {code}
 % bin/nutch parsechecker http://localhost/test.xlsx
 fetching: http://localhost/test.xlsx
 parsing: http://localhost/test.xlsx
 contentType: application/zip
 ...
 {code}
 Xlsx files are formally zip files. Nevertheless, both HTTP header and file 
 name are clear:
 {code}
 % wget -d http://localhost/test.xlsx
 ...
 HTTP/1.1 200 OK
 ...
 Content-Type: 
 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
 ...
 {code}
 Tika 1.4 detects the type correctly:
 {code}
 % java -jar tika-app-1.4.jar -d http://localhost/test/test.xlsx
 application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira