[
https://issues.apache.org/jira/browse/TIKA-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Stepner updated TIKA-2064:
----------------------------------
Attachment: stata_test_data.dta
Hi Nick,
I've created a test dataset and attached it here.
This file was created in Stata 13.1 running on Mac OS X. There are small
differences in the format of the files depending on the Stata version and the
operating system it's run on. But I don't imagine it's worthwhile to build
more tests unless we see a bug with a Stata file created using a different
version or OS.
The code that created the test dataset is:
```
clear all
set obs 3
gen byte integers=_n
gen double reals = sqrt(_n)
gen fruits = ""
replace fruits = "apple" in 1
replace fruits = "banana" in 2
replace fruits = "cantaloupe" in 3
save stata_test_data.dta
```
I'd like to release this code and related dataset to the public domain using
[CC0](https://creativecommons.org/publicdomain/zero/1.0/). I can also apply
any more restrictive license you prefer that makes it easy for you to use the
file.
> Document type detected incorrectly for Stata datasets (.dta extension)
> ----------------------------------------------------------------------
>
> Key: TIKA-2064
> URL: https://issues.apache.org/jira/browse/TIKA-2064
> Project: Tika
> Issue Type: Bug
> Components: detector
> Affects Versions: 1.13
> Reporter: Michael Stepner
> Attachments: stata_test_data.dta
>
>
> The content type of Stata datasets (created using http://www.stata.com
> software) is incorrectly detected as `text/html` by Tika. I have tested this
> using the latest release of Tika, v1.13:
> ```
> $ curl -O http://www.stata-press.com/data/r14/auto.dta
> $ java -jar tika-app-1.13.jar --detect auto.dta
> text/html
> ```
> I believe that the type should instead be `application/octet-stream` (or the
> equivalent).
> I originally reported this bug downstream (at
> https://github.com/laurilehmijoki/s3_website/issues/232), and was advised to
> report upstream to Tika. In addition to the one I downloaded using `curl` in
> my example, a variety of reference Stata datasets are posted here:
> http://www.stata-press.com/data/r14/dmain.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)