[ 
https://issues.apache.org/jira/browse/TIKA-1092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13602966#comment-13602966
 ] 

Nick Burch commented on TIKA-1092:
----------------------------------

If that does check out, we'll likely want to add something like:

  <mime-type type="application/msword2">
    <!-- Pre-OLE2, not a subtype of application/x-tika-msoffice -->
    <_comment>Microsoft Word 2 Document</_comment>
    <magic priority="50">
      <match value="0x9ba5" type="string" />
      <match value="0xdba5" type="string" />
    </magic>
  </mime-type>
  <mime-type type="application/msword5">
    <!-- Pre-OLE2, not a subtype of application/x-tika-msoffice -->
    <_comment>Microsoft Word 5 Document</_comment>
    <magic priority="50">
      <match value="0xfe37" type="string" />
    </magic>
  </mime-type>

(That's based on the magic numbers found from looking through the wv sourcecode 
for hints)
                
> Parsing of old Word file causes a TikaException
> -----------------------------------------------
>
>                 Key: TIKA-1092
>                 URL: https://issues.apache.org/jira/browse/TIKA-1092
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Giuseppe Totaro
>            Priority: Minor
>              Labels: office, parse, word-exception
>
> I found an issue with the parse method of 
> org.apache.tika.parser.microsoft.OfficeParser. This parser generates a Tika 
> Exception when it try to parse very old file of Microsoft Word.
> I think this issue is not a priority because the files that cause the 
> exception belong to an obsolete format/structure that even new Microsoft 
> Office versions don't support them, but it's important to know that something 
> wrong about these outdated types can happen.
> I report two links about old types (Microsoft support perspective):
> http://support.microsoft.com/?kbid=922850
> http://support.microsoft.com/kb/922849/it
> For example, the message of TikaException is below:
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: 
> Illegal IOException from 
> org.apache.tika.parser.microsoft.OfficeParser@789ab21d
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:248)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139)
>       at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415)
>       at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109)
> Caused by: java.io.IOException: Invalid header signature; read 
> 0x0410401F002DA5DB, expected 0xE11AB1A1E011CFD0
>       at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140)
>       at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:115)
>       at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:198)
>       at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:184)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:156)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       ... 5 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to