[ 
https://issues.apache.org/jira/browse/TIKA-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stepner updated TIKA-2064:
----------------------------------
    Description: 
The content type of Stata datasets (created using http://www.stata.com 
software) is incorrectly detected as `text/html` by Tika. I have tested this 
using the latest release of Tika, v1.13:

```
$ curl -O http://www.stata-press.com/data/r14/auto.dta
$ java -jar tika-app-1.13.jar --detect auto.dta
text/html
```

I believe that the type should instead be `application/octet-stream` (or the 
equivalent).

I originally reported this bug downstream (at 
https://github.com/laurilehmijoki/s3_website/issues/232), and was advised to 
report upstream to Tika. In addition to the one I downloaded using `curl` in my 
example, a variety of reference Stata datasets are posted here: 
http://www.stata-press.com/data/r14/dmain.html

  was:
The content type of Stata datasets (created using http://www.stata.com 
software) is incorrectly detected as `text/html` by Tika. I have tested this 
using the latest release of Tika, v1.13:

```
Planarian:Desktop michael$ curl -O http://www.stata-press.com/data/r14/auto.dta
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6443  100  6443    0     0  48363      0 --:--:-- --:--:-- --:--:-- 48443
Planarian:Desktop michael$ java -jar tika-app-1.13.jar --detect auto.dta
text/html
```

I believe that the type should instead be `application/octet-stream` (or the 
equivalent).

I originally reported this bug downstream (at 
https://github.com/laurilehmijoki/s3_website/issues/232), and was advised to 
report upstream to Tika. In addition to the one I downloaded using `curl` in my 
example, a variety of reference Stata datasets are posted here: 
http://www.stata-press.com/data/r14/dmain.html


> Document type detected incorrectly for Stata datasets (.dta extension)
> ----------------------------------------------------------------------
>
>                 Key: TIKA-2064
>                 URL: https://issues.apache.org/jira/browse/TIKA-2064
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 1.13
>            Reporter: Michael Stepner
>
> The content type of Stata datasets (created using http://www.stata.com 
> software) is incorrectly detected as `text/html` by Tika. I have tested this 
> using the latest release of Tika, v1.13:
> ```
> $ curl -O http://www.stata-press.com/data/r14/auto.dta
> $ java -jar tika-app-1.13.jar --detect auto.dta
> text/html
> ```
> I believe that the type should instead be `application/octet-stream` (or the 
> equivalent).
> I originally reported this bug downstream (at 
> https://github.com/laurilehmijoki/s3_website/issues/232), and was advised to 
> report upstream to Tika. In addition to the one I downloaded using `curl` in 
> my example, a variety of reference Stata datasets are posted here: 
> http://www.stata-press.com/data/r14/dmain.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to