[
https://issues.apache.org/jira/browse/TIKA-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13898712#comment-13898712
]
Tim Allison commented on TIKA-1236:
-----------------------------------
That's probably for the better. :) I'm finding a bit more clarity on this...I
had a chance to look at POI's MAPIMessage and its history. I think that when
the Tika code was written, MAPIMessage's guess7BitEncoding() only looked at the
headers. So, Tika's code says "if there are headers, try to get the encoding
in the headers, otherwise try CharsetDetector." guess7BitEncoding() now looks
in other places, including the internet codepage (InternetCPID) parameter and
the html if there is an htmlbody...this is now probably more robust than
CharsetDetector.
So, a question for the community, do we still need CharsetDetector to determine
the encoding of 7bit strings in MSG files at all?
My answer on a very small number of files is probably not, but folks with more
experience and files on hand might have a different answer.
Ideally, it would be great if guess7BitEncoding returned a boolean for whether
or not it found something in one of those three places, and if it didn't, then
we could run CharsetDetector. If that is the route to go, I'll open an issue
in POI and add a return value.
Thank you.
> CharsetDetector returning unsupported encoding for some 7-bit Outlook/MSG
> files
> -------------------------------------------------------------------------------
>
> Key: TIKA-1236
> URL: https://issues.apache.org/jira/browse/TIKA-1236
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.6
> Reporter: Tim Allison
> Priority: Minor
> Attachments: TIKA-1236.patch
>
>
> When parsing a 7-bit encoded Outlook post (.msg without headers), Tika tries
> to detect the encoding. For a handful of files, the CharsetDetector returns
> "IBM424_rtl" with a confidence > the threshold. This encoding is then set
> with MAPIMessage.set7BitEncoding(). When MAPI tries to use this encoding, it
> finds that it is unsupported and throws an exception.
> Full stacktrace:
> {noformat}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@72ccd846
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookEncoding(OutlookParserTest.java:264)
> ...irrelevant test framework junk...
> Caused by: java.lang.RuntimeException: Encoding not found - IBM424_rtl
> at
> org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:149)
> at
> org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:85)
> at
> org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74)
> at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:455)
> at
> org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:95)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:223)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 26 more
> Caused by: java.io.UnsupportedEncodingException: IBM424_rtl
> at java.lang.StringCoding.decode(Unknown Source)
> at java.lang.String.<init>(Unknown Source)
> at java.lang.String.<init>(Unknown Source)
> at
> org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:147)
> ... 33 more
> {noformat}
> Unfortunately, I can't share the problematic documents, and I can't create a
> synthetic document that triggers this issue.
> Two questions:
> 1) Should CharsetDetector return an encoding that is not supported?
> 2) If so, should we add a simple check before calling set7BitEncoding()?
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)