[
https://issues.apache.org/jira/browse/TIKA-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17028898#comment-17028898
]
Mihai Glont commented on TIKA-3034:
-----------------------------------
Hi Nick,
We do specify the file name in the metadata instance we pass to the
{{DefaultDetector}}. The problem was that we were sending a plain InputStream
to the detector rather than a TikaInputStream, which, according to the
[docs|[https://tika.apache.org/1.4/detection.html]], disables all but the MIME
type checks
{quote}Because these container detectors needs to read the whole file to open
and inspect the container, they must be used with a
[org.apache.tika.io.TikaInputStream|https://tika.apache.org/1.4/api/org/apache/tika/io/TikaInputStream.html].
If called with a regular {{InputStream}}, then all work will be done by the
default Mime Magic detection only.
{quote}
The fix in our case was to use{{}}[
TikaInputStream.get|[https://tika.apache.org/1.4/api/org/apache/tika/io/TikaInputStream.html#get(java.io.InputStream)]]
to cast our input to a {{TikaInputStream.}}
Thanks for your help, this issue can now be closed.
> Detector always returns text/plain when scanning Mathematica files
> ------------------------------------------------------------------
>
> Key: TIKA-3034
> URL: https://issues.apache.org/jira/browse/TIKA-3034
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 1.23
> Reporter: Tung Nguyen
> Priority: Blocker
> Fix For: 1.23
>
>
> We are working with Tika to implement our mime types detection module. The
> library seemingly cannot detect Mathematica files although the documentation
> confirmed it does [1]. The Tika detector always returns `text/plain` instead
> of `application/mathematica` as described in the documentation as well as
> unit tests [2].
> By doing the same need with Python code as below, we can obtain the right
> mime types for any Mathematica file downloaded from the Wolfram Library
> Archive [3].
> {code:java}
> #!/usr/bin/python3
> import mimetypes, os, sys
> test_file = sys.argv[1]
> print(mimetypes.MimeTypes().guess_type(test_file)[0])
> {code}
> Therefore, we suspected there is a bug in Tika detector where it tries to
> guess mime types for Mathematica files.
> References:
> [1] [https://tika.apache.org/1.23/formats.html]
> [2]
> [https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java#L64]
> [3] [https://library.wolfram.com/infocenter/Courseware/4706/]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)