[
https://issues.apache.org/jira/browse/TIKA-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1236:
------------------------------
Description:
When parsing a 7-bit encoded Outlook post (.msg without headers), Tika tries to
detect the encoding. For a handful of files, the CharsetDetector returns
"IBM424_rtl" with a confidence > the threshold. This encoding is then set
with MAPIMessage.set7BitEncoding(). When MAPI tries to use this encoding, it
finds that it is unsupported and throws an exception.
Full stacktrace:
{noformat}
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@72ccd846
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookEncoding(OutlookParserTest.java:264)
...irrelevant test framework junk...
Caused by: java.lang.RuntimeException: Encoding not found - IBM424_rtl
at
org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:149)
at
org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:85)
at
org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74)
at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:455)
at
org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:95)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:223)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 26 more
Caused by: java.io.UnsupportedEncodingException: IBM424_rtl
at java.lang.StringCoding.decode(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at
org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:147)
... 33 more
{noformat}
Unfortunately, I can't share the problematic documents, and I can't create a
synthetic document that triggers this issue.
Two questions:
1) Should CharsetDetector return an encoding that is not supported?
2) If so, should we add a simple check before calling set7BitEncoding()?
was:
When parsing a 7-bit encoded Outlook post (.msg without headers), Tika tries to
detect the encoding. For a handful of files, the EncodingDetector returns
"IBM424_rtl" with a confidence > the threshold. This encoding is then set
with MAPIMessage.set7BitEncoding(). When MAPI tries to use this encoding, it
finds that it is unsupported and throws an exception.
Full stacktrace:
{noformat}
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@72ccd846
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookEncoding(OutlookParserTest.java:264)
...irrelevant test framework junk...
Caused by: java.lang.RuntimeException: Encoding not found - IBM424_rtl
at
org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:149)
at
org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:85)
at
org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74)
at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:455)
at
org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:95)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:223)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 26 more
Caused by: java.io.UnsupportedEncodingException: IBM424_rtl
at java.lang.StringCoding.decode(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at
org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:147)
... 33 more
{noformat}
Unfortunately, I can't share the problematic documents, and I can't create a
synthetic document that triggers this issue.
Two questions:
1) Should EncodingDetector return an encoding that is not supported?
2) If so, should we add a simple check before calling set7BitEncoding()?
> CharsetDetector returning unsupported encoding for some 7-bit Outlook/MSG
> files
> -------------------------------------------------------------------------------
>
> Key: TIKA-1236
> URL: https://issues.apache.org/jira/browse/TIKA-1236
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.6
> Reporter: Tim Allison
> Priority: Minor
>
> When parsing a 7-bit encoded Outlook post (.msg without headers), Tika tries
> to detect the encoding. For a handful of files, the CharsetDetector returns
> "IBM424_rtl" with a confidence > the threshold. This encoding is then set
> with MAPIMessage.set7BitEncoding(). When MAPI tries to use this encoding, it
> finds that it is unsupported and throws an exception.
> Full stacktrace:
> {noformat}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@72ccd846
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookEncoding(OutlookParserTest.java:264)
> ...irrelevant test framework junk...
> Caused by: java.lang.RuntimeException: Encoding not found - IBM424_rtl
> at
> org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:149)
> at
> org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:85)
> at
> org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74)
> at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:455)
> at
> org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:95)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:223)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 26 more
> Caused by: java.io.UnsupportedEncodingException: IBM424_rtl
> at java.lang.StringCoding.decode(Unknown Source)
> at java.lang.String.<init>(Unknown Source)
> at java.lang.String.<init>(Unknown Source)
> at
> org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:147)
> ... 33 more
> {noformat}
> Unfortunately, I can't share the problematic documents, and I can't create a
> synthetic document that triggers this issue.
> Two questions:
> 1) Should CharsetDetector return an encoding that is not supported?
> 2) If so, should we add a simple check before calling set7BitEncoding()?
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)