[ 
https://issues.apache.org/jira/browse/TIKA-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224809#comment-14224809
 ] 

Tim Allison edited comment on TIKA-1487 at 11/25/14 4:54 PM:
-------------------------------------------------------------

This file comes from the govdocs1 
[corpus|http://digitalcorpora.org/corpora/govdocs/]


was (Author: [email protected]):
This file comes from the govdocs1 
[corpus|http://digitalcorpora.org/corpora/nps/nps/nps/nps/files/govdocs1/]

> Add mime for pre-OLE2 xls file
> ------------------------------
>
>                 Key: TIKA-1487
>                 URL: https://issues.apache.org/jira/browse/TIKA-1487
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Trivial
>         Attachments: 004444.xls
>
>
> On the govdocs1 corpus, nearly 91% of xls exceptions have this stacktrace:
> {noformat}
> Caused by: java.io.IOException: Invalid header signature; read 
> 0x0010000000060409, expected 0xE11AB1A1E011CFD0 - Your file appears not to be 
> a valid OLE2 document at 
> org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140) at 
> org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:115) at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:198)
>  at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:184)
>  at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:162) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 13 
> more
> {noformat}
> Excel is able to open the few files that I tried, and it looks like Excel 
> thinks these are version 4.
> On the POI user list, [~gagravarr] identified this header as pre-OLE2 and 
> asked that we add the mime to Tika so that we can handle appropriately.  Test 
> file soon to be attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to