Tim Allison created TIKA-1487:
---------------------------------

             Summary: Add mime for pre-OLE2 xls file
                 Key: TIKA-1487
                 URL: https://issues.apache.org/jira/browse/TIKA-1487
             Project: Tika
          Issue Type: Improvement
            Reporter: Tim Allison
            Priority: Trivial


On the govdocs1 corpus, nearly 91% of xls exceptions have this stacktrace:
{noformat}
Caused by: java.io.IOException: Invalid header signature; read 
0x0010000000060409, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a 
valid OLE2 document at 
org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140) at 
org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:115) at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:198)
 at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:184)
 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:162) 
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 
13 more
{noformat}

Excel is able to open the few files that I tried, and it looks like Excel 
thinks these are version 4.

On the POI user list, [~gagravarr] identified this header as pre-OLE2 and asked 
that we add the mime to Tika so that we can handle appropriately.  Test file 
soon to be attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to