Michael Stepner created TIKA-2064:
-------------------------------------
Summary: Document type detected incorrectly for Stata datasets
(.dta extension)
Key: TIKA-2064
URL: https://issues.apache.org/jira/browse/TIKA-2064
Project: Tika
Issue Type: Bug
Components: detector
Affects Versions: 1.13
Reporter: Michael Stepner
The content type of Stata datasets (created using http://www.stata.com
software) is incorrectly detected as `text/html` by Tika. I have tested this
using the latest release of Tika, v1.13:
```
Planarian:Desktop michael$ curl -O http://www.stata-press.com/data/r14/auto.dta
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 6443 100 6443 0 0 48363 0 --:--:-- --:--:-- --:--:-- 48443
Planarian:Desktop michael$ java -jar tika-app-1.13.jar --detect auto.dta
text/html
```
I believe that the type should instead be `application/octet-stream` (or the
equivalent).
I originally reported this bug downstream (at
https://github.com/laurilehmijoki/s3_website/issues/232), and was advised to
report upstream to Tika. In addition to the one I downloaded using `curl` in my
example, a variety of reference Stata datasets are posted here:
http://www.stata-press.com/data/r14/dmain.html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)