[
https://issues.apache.org/jira/browse/NUTCH-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13705068#comment-13705068
]
Sebastian Nagel commented on NUTCH-1605:
----------------------------------------
Tika does a good job in detecting the right mime type, even if you try to trick
Tika by not providing any URL/file ({{cat test.xls | java \-jar
tika-app-1.4.jar -d -}}) or renaming *.xlsx to *.zip.
1. detector o.a.t.mime.MimeTypes makes also use of HTTP content type and URL
(file name). Mime type is adjusted if URL or HTTP header provide a plausible
hint which results in a subclass type (here xlsx is a subclass of
{{application/zip}}).
2. tika-app uses more detectors. Detection of xlsx without any hints is done by
o.a.t.parser.pkg.ZipContainerDetector. But these additional detectors are not
available to o.a.n.util.MimeUtil simply because they are not contained in
tika-core-1.4.jar but in tika-parsers-1.4.jar. A trial (replace the dependency
in ivy/ivy.xml) fixes the problem but causes a bulk of jar files in lib/
because of transitive dependencies. To filter dependencies is not easy since
some deps are required (e.g., commons-compress).
Any ideas? 1 should be possible by patching MimeUtil. 2 is definitely the more
reliable solution since it works even if URL or HTTP content type give no hints
or are wrong.
> mime type detector recognizes xlsx as zip file
> ----------------------------------------------
>
> Key: NUTCH-1605
> URL: https://issues.apache.org/jira/browse/NUTCH-1605
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.7
> Reporter: Sebastian Nagel
> Attachments: test.xlsx
>
>
> With {{mime.type.magic}} as true (the default) Office Open XML spreadsheets
> (*.xlsx) are treated as zip files and not parsed correctly:
> {code}
> % bin/nutch parsechecker http://localhost/test.xlsx
> fetching: http://localhost/test.xlsx
> parsing: http://localhost/test.xlsx
> contentType: application/zip
> ...
> {code}
> Xlsx files are formally zip files. Nevertheless, both HTTP header and file
> name are clear:
> {code}
> % wget -d http://localhost/test.xlsx
> ...
> HTTP/1.1 200 OK
> ...
> Content-Type:
> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
> ...
> {code}
> Tika 1.4 detects the type correctly:
> {code}
> % java -jar tika-app-1.4.jar -d http://localhost/test/test.xlsx
> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira