[ 
https://issues.apache.org/jira/browse/TIKA-2530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392781#comment-16392781
 ] 

Tim Allison commented on TIKA-2530:
-----------------------------------

If you think it is fixable, yes, please do open an issue.  However, I'm not 
sure that it is an issue.  The file is being identified as an exe, and it 
clearly isn't...  So, are you asking that we change the mime signature somehow 
so that this file is processed as a text file?  Or are you suggesting that this 
exe file is a good file and we shouldn't be throwing an exception?  

> OutlookExtractor "buffer underrun" when parsing .msg with embedded .msg
> -----------------------------------------------------------------------
>
>                 Key: TIKA-2530
>                 URL: https://issues.apache.org/jira/browse/TIKA-2530
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.16, 1.17
>         Environment: Reproduced with both Tika 1.16 and Tika 1.17 on Windows 
> but the problem is likely on all platform.
>            Reporter: Pascal Essiembre
>            Assignee: Tim Allison
>            Priority: Major
>         Attachments: test_file.txt
>
>
> When parsing certain .msg files containing certain attachments (e.g. other 
> .msg files), I get this error:
> {noformat}
> ...
> Caused by: org.apache.poi.util.LittleEndian$BufferUnderrunException: buffer 
> underrun
>         at org.apache.poi.util.LittleEndian.readInt(LittleEndian.java:662)
>         at org.apache.poi.hmef.CompressedRTF.decompress(CompressedRTF.java:73)
>         at 
> org.apache.poi.util.LZWDecompresser.decompress(LZWDecompresser.java:81)
>         at 
> org.apache.poi.hmef.attribute.MAPIRtfAttribute.<init>(MAPIRtfAttribute.java:42)
>         at 
> org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:270)
> ...
> {noformat}
> I think the issue is with {{MAPIRtfAttribute}} not liking it when receiving 
> an empty byte array from {{OutlookExtractor}}.  I was able to eliminate the 
> error at around line 269 of {{OutlookExtractor}} with Tika 1.16 code (or 
> around line 322 with Tika 1.17) with the following:
> {code:java}
>             //--- START FIX ---
>             ByteChunk chunk = (ByteChunk) rtfChunk;
>             if (chunk != null && chunk.getValue() != null 
>                     && chunk.getValue().length > 0 && !doneBody) {
>                 //ByteChunk chunk = (ByteChunk) rtfChunk;
>             //--- END FIX ---
> {code}
> I am not sure if that is a real fix or more should be done than just getting 
> rid of the error to make sure all is extracted properly from all files.
> I cannot share the sample file I have to test since it was given to me as 
> sensitive content and I could not recreate a faulty msg file.
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to