[jira] [Commented] (TIKA-1236) CharsetDetector returning unsupported encoding for some 7-bit Outlook/MSG files

Tim Allison (JIRA) Tue, 11 Feb 2014 18:52:13 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13898712#comment-13898712
 ]


Tim Allison commented on TIKA-1236:
-----------------------------------

That's probably for the better. :)  I'm finding a bit more clarity on this...I 
had a chance to look at POI's MAPIMessage and its history.  I think that when 
the Tika code was written, MAPIMessage's guess7BitEncoding() only looked at the 
headers.  So, Tika's code says "if there are headers, try to get the encoding 
in the headers, otherwise try CharsetDetector."  guess7BitEncoding() now looks 
in other places, including the internet codepage (InternetCPID) parameter and 
the html if there is an htmlbody...this is now probably more robust than 
CharsetDetector.

So, a question for the community, do we still need CharsetDetector to determine 
the encoding of 7bit strings in MSG files at all?

My answer on a very small number of files is probably not, but folks with more 
experience and files on hand might have a different answer.

Ideally, it would be great if guess7BitEncoding returned a boolean for whether 
or not it found something in one of those three places, and if it didn't, then 
we could run CharsetDetector.  If that is the route to go, I'll open an issue 
in POI and add a return value.

Thank you.

> CharsetDetector returning unsupported encoding for some 7-bit Outlook/MSG 
> files
> -------------------------------------------------------------------------------
>
>                 Key: TIKA-1236
>                 URL: https://issues.apache.org/jira/browse/TIKA-1236
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: TIKA-1236.patch
>
>
> When parsing a 7-bit encoded Outlook post (.msg without headers), Tika tries 
> to detect the encoding.  For a handful of files, the CharsetDetector returns 
> "IBM424_rtl" with a confidence > the threshold.  This encoding is then set  
> with MAPIMessage.set7BitEncoding().  When MAPI tries to use this encoding, it 
> finds that it is unsupported and throws an exception. 
> Full stacktrace:
> {noformat}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@72ccd846
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookEncoding(OutlookParserTest.java:264)
> ...irrelevant test framework junk...
> Caused by: java.lang.RuntimeException: Encoding not found - IBM424_rtl
>       at 
> org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:149)
>       at 
> org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:85)
>       at 
> org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74)
>       at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:455)
>       at 
> org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:95)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:223)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       ... 26 more
> Caused by: java.io.UnsupportedEncodingException: IBM424_rtl
>       at java.lang.StringCoding.decode(Unknown Source)
>       at java.lang.String.<init>(Unknown Source)
>       at java.lang.String.<init>(Unknown Source)
>       at 
> org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:147)
>       ... 33 more
> {noformat}
> Unfortunately, I can't share the problematic documents, and I can't create a 
> synthetic document that triggers this issue.
> Two questions:
> 1)  Should CharsetDetector return an encoding that is not supported?
> 2)  If so, should we add a simple check before calling set7BitEncoding()?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1236) CharsetDetector returning unsupported encoding for some 7-bit Outlook/MSG files

Reply via email to