Tim Allison created TIKA-1236:
---------------------------------

             Summary: EncodingDetector returning unsupported encoding for some 
7-bit Outlook/MSG files
                 Key: TIKA-1236
                 URL: https://issues.apache.org/jira/browse/TIKA-1236
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.6
            Reporter: Tim Allison
            Priority: Minor


When parsing a 7-bit encoded Outlook post (.msg without headers), Tika tries to 
detect the encoding.  For a handful of files, the EncodingDetector returns 
"IBM424_rtl" with a confidence > the threshold.  This encoding is then set  
with MAPIMessage.set7BitEncoding().  When MAPI tries to use this encoding, it 
finds that it is unsupported and throws an exception. 
Full stacktrace:

{noformat}
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@72ccd846
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at 
org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookEncoding(OutlookParserTest.java:264)
...irrelevant test framework junk...
Caused by: java.lang.RuntimeException: Encoding not found - IBM424_rtl
        at 
org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:149)
        at 
org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:85)
        at 
org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74)
        at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:455)
        at 
org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:95)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:223)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        ... 26 more
Caused by: java.io.UnsupportedEncodingException: IBM424_rtl
        at java.lang.StringCoding.decode(Unknown Source)
        at java.lang.String.<init>(Unknown Source)
        at java.lang.String.<init>(Unknown Source)
        at 
org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:147)
        ... 33 more
{noformat}

Unfortunately, I can't share the problematic documents, and I can't create a 
synthetic document that triggers this issue.

Two questions:
1)  Should EncodingDetector return an encoding that is not supported?
2)  If so, should we add a simple check before calling set7BitEncoding()?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to