[ https://issues.apache.org/jira/browse/TIKA-3992 ]


    Josh McCullough deleted comment on TIKA-3992:
    ---------------------------------------

was (Author: joshm):
`las` file-type detection is returning `application/octet-stream` while `laz` 
is working correctly. Using Tika `2.9.1` with `tika-parsers-standard-package`.

> Add common missing mimes based on Common Crawl data
> ---------------------------------------------------
>
>                 Key: TIKA-3992
>                 URL: https://issues.apache.org/jira/browse/TIKA-3992
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: mimes.zip
>
>
> In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as 
> detected by Tika.  It would be useful to extract those (even if truncated) 
> and run 'file' and 'siegfried' against those file types that are unknown to 
> Tika.  We can prioritize the most common file formats as identified by file 
> and siegfried for addition to our mime-types.xml.
> Separately, we might also want to do the same thing for 
> `application/zip`...there are likely zip-based file types that we could do a 
> better job on.
> Thanks to [~snagel] for a dump of stats on the most recent crawl.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to