[ 
https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13705068#comment-13705068
 ] 

Sebastian Nagel commented on NUTCH-1605:
----------------------------------------

Tika does a good job in detecting the right mime type, even if you try to trick 
Tika by not providing any URL/file ({{cat test.xls | java \-jar 
tika-app-1.4.jar -d -}}) or renaming *.xlsx to *.zip.

1. detector o.a.t.mime.MimeTypes makes also use of HTTP content type and URL 
(file name). Mime type is adjusted if URL or HTTP header provide a plausible 
hint which results in a subclass type (here xlsx is a subclass of 
{{application/zip}}).

2. tika-app uses more detectors. Detection of xlsx without any hints is done by 
o.a.t.parser.pkg.ZipContainerDetector. But these additional detectors are not 
available to o.a.n.util.MimeUtil simply because they are not contained in 
tika-core-1.4.jar but in tika-parsers-1.4.jar. A trial (replace the dependency 
in ivy/ivy.xml) fixes the problem but causes a bulk of jar files in lib/ 
because of transitive dependencies. To filter dependencies is not easy since 
some deps are required (e.g., commons-compress). 

Any ideas? 1 should be possible by patching MimeUtil. 2 is definitely the more 
reliable solution since it works even if URL or HTTP content type give no hints 
or are wrong.

                
> mime type detector recognizes xlsx as zip file
> ----------------------------------------------
>
>                 Key: NUTCH-1605
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1605
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.7
>            Reporter: Sebastian Nagel
>         Attachments: test.xlsx
>
>
> With {{mime.type.magic}} as true (the default) Office Open XML spreadsheets 
> (*.xlsx) are treated as zip files and not parsed correctly:
> {code}
> % bin/nutch parsechecker http://localhost/test.xlsx
> fetching: http://localhost/test.xlsx
> parsing: http://localhost/test.xlsx
> contentType: application/zip
> ...
> {code}
> Xlsx files are formally zip files. Nevertheless, both HTTP header and file 
> name are clear:
> {code}
> % wget -d http://localhost/test.xlsx
> ...
> HTTP/1.1 200 OK
> ...
> Content-Type: 
> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
> ...
> {code}
> Tika 1.4 detects the type correctly:
> {code}
> % java -jar tika-app-1.4.jar -d http://localhost/test/test.xlsx
> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to