[
https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060254#comment-15060254
]
Tim Allison commented on TIKA-1813:
-----------------------------------
Duh...I initially posted the exceptions on the theory that we may be misreading
an old version of how many bytes to read, but y, truncated makes sense.
I'll post some other tika-msoffice that didn't cause exceptions. Thank you for
the tip on the header dumper.
> Figure out file types for several unknown OLE files in Common Crawl
> -------------------------------------------------------------------
>
> Key: TIKA-1813
> URL: https://issues.apache.org/jira/browse/TIKA-1813
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB,
> 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 27BYDLE36XWCDZXA3PPV6MF524UQ6KAF,
> 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA
>
>
> We're getting around 300 exceptions from "application/x-tika-msoffice" files
> in our current slice of Common Crawl documents that look roughly like this:
> {noformat}
> java.lang.IllegalArgumentException: Position 86528 past the end of the file
> at org.apache.poi.poifs.nio.FileBackedDataSource.read
> {noformat}
> I suspect these are non-MS OLE file formats. Any help identifying the file
> types and patching our OLE mime detector would be great.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)