[
https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1813:
------------------------------
Attachment: unidentified_ole_docs_in_common_crawl_slice.csv
Rather than posting files, here's the list of files that did not result in an
exception (probably not truncated) and were identified as x-tika-msoffice.
> Figure out file types for several unknown OLE files in Common Crawl
> -------------------------------------------------------------------
>
> Key: TIKA-1813
> URL: https://issues.apache.org/jira/browse/TIKA-1813
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB,
> 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 27BYDLE36XWCDZXA3PPV6MF524UQ6KAF,
> 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA,
> unidentified_ole_docs_in_common_crawl_slice.csv
>
>
> We're getting around 300 exceptions from "application/x-tika-msoffice" files
> in our current slice of Common Crawl documents that look roughly like this:
> {noformat}
> java.lang.IllegalArgumentException: Position 86528 past the end of the file
> at org.apache.poi.poifs.nio.FileBackedDataSource.read
> {noformat}
> I suspect these are non-MS OLE file formats. Any help identifying the file
> types and patching our OLE mime detector would be great.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)