[ https://issues.apache.org/jira/browse/TIKA-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13600065#comment-13600065 ]
Giuseppe Totaro commented on TIKA-1092: --------------------------------------- Hi Nick, I'm agree with your first observation about old office documents. I don't think that someone has renamed the files. These files were created with an older version of Word (I think Microsoft Word 6.0) and they were saved with .doc extension. Unfortunately I can't supply my set of files because they are classified. I'll send you one or more files If I find documents without confidentiality limits that generate the same exception. Thanks, Giuseppe > Parsing of old Word file causes a TikaException > ----------------------------------------------- > > Key: TIKA-1092 > URL: https://issues.apache.org/jira/browse/TIKA-1092 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Giuseppe Totaro > Priority: Minor > Labels: office, parse, word-exception > > I found an issue with the parse method of > org.apache.tika.parser.microsoft.OfficeParser. This parser generates a Tika > Exception when it try to parse very old file of Microsoft Word. > I think this issue is not a priority because the files that cause the > exception belong to an obsolete format/structure that even new Microsoft > Office versions don't support them, but it's important to know that something > wrong about these outdated types can happen. > I report two links about old types (Microsoft support perspective): > http://support.microsoft.com/?kbid=922850 > http://support.microsoft.com/kb/922849/it > For example, the message of TikaException is below: > Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: > Illegal IOException from > org.apache.tika.parser.microsoft.OfficeParser@789ab21d > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) > Caused by: java.io.IOException: Invalid header signature; read > 0x0410401F002DA5DB, expected 0xE11AB1A1E011CFD0 > at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140) > at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:115) > at > org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:198) > at > org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:184) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > ... 5 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira