[
https://issues.apache.org/jira/browse/TIKA-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027621#comment-17027621
]
Nick Burch commented on TIKA-3034:
----------------------------------
Can you try and pass the filename along with the contents when you detect?
Detection of text-based file formats is very tricky, as there is often very
little that's unique to tell them apart. For Mathematica, we don't actually
have any magic at all (some programming languages we do), so we need the
filename to specialise the text/plain type.
Without the code that you're using to call Tika, it's a bit tricky to know how
to tell you to pass the filename in too! (It depends if you're using the App,
the Server, Tika Facade class or Detector directly)
Otherwise, if you do know of a unique bit of text that could be found near the
top of a Mathematica file, but not of other text files, please let us know and
we can add that in for more effective detection
> Detector always returns text/plain when scanning Mathematica files
> ------------------------------------------------------------------
>
> Key: TIKA-3034
> URL: https://issues.apache.org/jira/browse/TIKA-3034
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 1.23
> Reporter: Tung Nguyen
> Priority: Blocker
> Fix For: 1.23
>
>
> We are working with Tika to implement our mime types detection module. The
> library seemingly cannot detect Mathematica files although the documentation
> confirmed it does [1]. The Tika detector always returns `text/plain` instead
> of `application/mathematica` as described in the documentation as well as
> unit tests [2].
> By doing the same need with Python code as below, we can obtain the right
> mime types for any Mathematica file downloaded from the Wolfram Library
> Archive [3].
> {code:java}
> #!/usr/bin/python3
> import mimetypes, os, sys
> test_file = sys.argv[1]
> print(mimetypes.MimeTypes().guess_type(test_file)[0])
> {code}
> Therefore, we suspected there is a bug in Tika detector where it tries to
> guess mime types for Mathematica files.
> References:
> [1] [https://tika.apache.org/1.23/formats.html]
> [2]
> [https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java#L64]
> [3] [https://library.wolfram.com/infocenter/Courseware/4706/]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)