[
https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17778820#comment-17778820
]
Josh McCullough commented on TIKA-3992:
---------------------------------------
`las` file-type detection is returning `application/octet-stream` while `laz`
is working correctly. Using Tika `2.9.1` with `tika-parsers-standard-package`.
> Add common missing mimes based on Common Crawl data
> ---------------------------------------------------
>
> Key: TIKA-3992
> URL: https://issues.apache.org/jira/browse/TIKA-3992
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: mimes.zip
>
>
> In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as
> detected by Tika. It would be useful to extract those (even if truncated)
> and run 'file' and 'siegfried' against those file types that are unknown to
> Tika. We can prioritize the most common file formats as identified by file
> and siegfried for addition to our mime-types.xml.
> Separately, we might also want to do the same thing for
> `application/zip`...there are likely zip-based file types that we could do a
> better job on.
> Thanks to [~snagel] for a dump of stats on the most recent crawl.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)