[ 
https://issues.apache.org/jira/browse/TIKA-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13898189#comment-13898189
 ] 

Tim Allison edited comment on TIKA-1236 at 2/11/14 7:42 PM:
------------------------------------------------------------

First draft of a patch.  The basic idea is to pick the first supported 
encoding; if unsupported encodings are found along the way, those are reported 
in the metadata.  [~kkrugler], this goes against your recommendation.  

Encoding detectors will not always be correct, and we risk Mojibake whether we 
use a detector or don't.  The question is which is more likely to be wrong. 
Currently, if we don't have confidence in the detection, we use the heuristic 
of < 35, and choose not to pass the encoding to POI. Another parameter of 
confidence is the length of the text...perhaps we should take that into 
consideration in combination with the confidence score that CharsetDecoder 
returns?  More heuristics...

I am hesitant about the solution in the patch, but it does prevent unsupported 
character exceptions, and it reports (via metadata) that there was an 
unsupported character detection.

Feedback?




was (Author: [email protected]):
First draft of a patch.  The basic idea is to pick the first supported encoding 
and report unsupported encodings in the metadata.  

If the basic notion is acceptable, I'll want to add UNSUPPORTED_ENCODING to the 
metadata framework...somehow.

Another thought, which I don't much like...we currently have a heuristic for 
confidence.  A heuristic for length would have worked for my particular batch 
of problem docs.  The lengths were all < 50 bytes.  Should we add a length 
heuristic?

> CharsetDetector returning unsupported encoding for some 7-bit Outlook/MSG 
> files
> -------------------------------------------------------------------------------
>
>                 Key: TIKA-1236
>                 URL: https://issues.apache.org/jira/browse/TIKA-1236
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: TIKA-1236.patch
>
>
> When parsing a 7-bit encoded Outlook post (.msg without headers), Tika tries 
> to detect the encoding.  For a handful of files, the CharsetDetector returns 
> "IBM424_rtl" with a confidence > the threshold.  This encoding is then set  
> with MAPIMessage.set7BitEncoding().  When MAPI tries to use this encoding, it 
> finds that it is unsupported and throws an exception. 
> Full stacktrace:
> {noformat}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@72ccd846
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookEncoding(OutlookParserTest.java:264)
> ...irrelevant test framework junk...
> Caused by: java.lang.RuntimeException: Encoding not found - IBM424_rtl
>       at 
> org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:149)
>       at 
> org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:85)
>       at 
> org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74)
>       at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:455)
>       at 
> org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:95)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:223)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       ... 26 more
> Caused by: java.io.UnsupportedEncodingException: IBM424_rtl
>       at java.lang.StringCoding.decode(Unknown Source)
>       at java.lang.String.<init>(Unknown Source)
>       at java.lang.String.<init>(Unknown Source)
>       at 
> org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:147)
>       ... 33 more
> {noformat}
> Unfortunately, I can't share the problematic documents, and I can't create a 
> synthetic document that triggers this issue.
> Two questions:
> 1)  Should CharsetDetector return an encoding that is not supported?
> 2)  If so, should we add a simple check before calling set7BitEncoding()?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to