[
https://issues.apache.org/jira/browse/TIKA-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13601195#comment-13601195
]
Giuseppe Totaro commented on TIKA-1092:
---------------------------------------
Hi Nick,
most files were created in 1992 (before the launch of Word 6).
When I try to open these files with my Word version (Office 2007) I receive the
message:
"You are attempting to open a file that was created in an earlier version of
Microsoft Office. This file type is blocked from opening in this version by
your registry policy setting."
To open the file I must apply the manual (or fix app) correction to Windows
registry following the instructions reported in
http://support.microsoft.com/kb/922849/en-us#fixit4me
After the correction, I'm able to open the file with Word and I see the
document text correctly. If I try to save the file (on itself), the Word
application ask me to select a type. Thus I can see the file with Word but I'm
not able to know the original type version of the document and the application
used to create it.
I attempted to know other information about these misterious files, but I
didn't obtain relevant results. For example, I used the command-line tool
"file" under Linux or other metadata analyzer (don't worry... Tika remains my
favorite parser :)).
Thanks,
Giuseppe
> Parsing of old Word file causes a TikaException
> -----------------------------------------------
>
> Key: TIKA-1092
> URL: https://issues.apache.org/jira/browse/TIKA-1092
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Giuseppe Totaro
> Priority: Minor
> Labels: office, parse, word-exception
>
> I found an issue with the parse method of
> org.apache.tika.parser.microsoft.OfficeParser. This parser generates a Tika
> Exception when it try to parse very old file of Microsoft Word.
> I think this issue is not a priority because the files that cause the
> exception belong to an obsolete format/structure that even new Microsoft
> Office versions don't support them, but it's important to know that something
> wrong about these outdated types can happen.
> I report two links about old types (Microsoft support perspective):
> http://support.microsoft.com/?kbid=922850
> http://support.microsoft.com/kb/922849/it
> For example, the message of TikaException is below:
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198:
> Illegal IOException from
> org.apache.tika.parser.microsoft.OfficeParser@789ab21d
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
> Caused by: java.io.IOException: Invalid header signature; read
> 0x0410401F002DA5DB, expected 0xE11AB1A1E011CFD0
> at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140)
> at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:115)
> at
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:198)
> at
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:184)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 5 more
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira