[ 
https://issues.apache.org/jira/browse/TIKA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15062115#comment-15062115
 ] 

Tim Allison commented on TIKA-1813:
-----------------------------------

It's a small world after all (TIKA-1814)...the only three files in our current 
corpus that have an XMP packet header that is not UTF-8 are unidentified 
x-tika-msoffice files that embed xmp as UTF-16LE.

commoncrawl2/NY/NYTDNLZNXV5E6OLD5KAXUTGVKF426P7W
commoncrawl2/AS/ASPKXLOYDSAGEMVDQX44PNBP7Q4XFDJ7
commoncrawl2_likely_broken/46/46ZWPDW653763E7QVXQHY372SS2FGADG.fla

> Figure out file types for several unknown OLE files in Common Crawl
> -------------------------------------------------------------------
>
>                 Key: TIKA-1813
>                 URL: https://issues.apache.org/jira/browse/TIKA-1813
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: 225HYXAEU2DKSBNQ3SVD3HXCYMSHXVTB, 
> 25JIANLV77U645GUSJ2E67YSM4B2TNSP, 27BYDLE36XWCDZXA3PPV6MF524UQ6KAF, 
> 2FVEYARCLFMHZ3MPBUH4D3RGPY2EJ4RA, 
> unidentified_ole_docs_in_common_crawl_slice.csv
>
>
> We're getting around 300 exceptions from "application/x-tika-msoffice" files 
> in our current slice of Common Crawl documents that look roughly like this:
> {noformat}
> java.lang.IllegalArgumentException: Position 86528 past the end of the file
>     at org.apache.poi.poifs.nio.FileBackedDataSource.read
> {noformat}
> I suspect these are non-MS OLE file formats.  Any help identifying the file 
> types and patching our OLE mime detector would be great.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to