[ 
https://issues.apache.org/jira/browse/TIKA-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15446521#comment-15446521
 ] 

Michael Stepner commented on TIKA-2064:
---------------------------------------

Thanks Nick, glad you figured out that there's a non-generic mime type used for 
these files.

I've asked the Stata community about the license of the auto.dta test file, 
which is the most commonly used test file for Stata datasets.  My question is 
posted here: 
http://www.statalist.org/forums/forum/general-stata-discussion/general/1354690-license-for-auto-dta

I could also create a small test DTA for use in unit tests.  Do you have a 
preferred way for me to transfer this file to you?

In case it is useful in configuring proper detection, the reference spec for 
Stata DTA files is open and published here: http://www.stata.com/help.cgi?dta

> Document type detected incorrectly for Stata datasets (.dta extension)
> ----------------------------------------------------------------------
>
>                 Key: TIKA-2064
>                 URL: https://issues.apache.org/jira/browse/TIKA-2064
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.13
>            Reporter: Michael Stepner
>
> The content type of Stata datasets (created using http://www.stata.com 
> software) is incorrectly detected as `text/html` by Tika. I have tested this 
> using the latest release of Tika, v1.13:
> ```
> $ curl -O http://www.stata-press.com/data/r14/auto.dta
> $ java -jar tika-app-1.13.jar --detect auto.dta
> text/html
> ```
> I believe that the type should instead be `application/octet-stream` (or the 
> equivalent).
> I originally reported this bug downstream (at 
> https://github.com/laurilehmijoki/s3_website/issues/232), and was advised to 
> report upstream to Tika. In addition to the one I downloaded using `curl` in 
> my example, a variety of reference Stata datasets are posted here: 
> http://www.stata-press.com/data/r14/dmain.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to