[jira] [Comment Edited] (TIKA-1236) CharsetDetector returning unsupported encoding for some 7-bit Outlook/MSG files

Tim Allison (JIRA) Tue, 11 Feb 2014 03:39:36 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13897759#comment-13897759
 ]


Tim Allison edited comment on TIKA-1236 at 2/11/14 11:38 AM:
-------------------------------------------------------------

I'll take a look at the history of OutlookExtractor today.

I agree that the user should have a way of knowing if an unsupported encoding 
was identified.
 
A few questions:
1) Does anyone know why OutlookExtractor uses CharsetDetector and not 
EncodingDetector or AutoDetectReader?  Is there still a use case for 
CharsetDetector instead of EncodingDetector?

2) Is the license at the header of CharsetDetector consistent with ASL2?

3) Given that HTMLEncodingDetector, Icu4JEncodingDetector and 
UniversalEncodingDetector check for whether a charset is supported before 
returning it, should we modify those to add "Unsupported-Encoding" or similar 
(hopefully something consistent with xmp/dc) to a document's metadata?

If the answer to 1) is that OutlookExtractor just hasn't yet been updated to 
AutoDetectReader and there is no known performance reason to keep 
CharsetDetector, then I propose:

a) updating OutlookParser to use AutoDetectReader
b) opening a new issue to modify the other three detectors to add metadata for 
unsupported encodings

Then, if a user wants Tika to fallback to a default encoding, s/he can add a 
dummy "AlwaysASCIIEncodingDetector" (or similar) to the list of 
EncodingDetectors.  The client can also check the metadata for unsupported 
encodings that a detector ranked higher than the encoding that was used.  



was (Author: [email protected]):
I'll take a look at the history of OutlookParser today.

I agree that the user should have a way of knowing if an unsupported encoding 
was identified.
 
A few questions:
1) Does anyone know why OutlookParser uses CharsetDetector and not 
EncodingDetector or AutoDetectReader?  Is there still a use case for 
CharsetDetector instead of EncodingDetector?
2) Is the license at the header of CharsetDetector consistent with ASL2?
3) Given that HTMLEncodingDetector, Icu4JEncodingDetector and 
UniversalEncodingDetector check for whether a charset is supported before 
returning it, should we modify those to add "Unsupported-Encoding" or similar 
(hopefully something consistent with xmp/dc) to a document's metadata?

If the answer to 1) is that OutlookParser just hasn't yet been updated to 
AutoDetectReader and there is no known performance reason to keep 
CharsetDetector, then I propose:

a) updating OutlookParser to use AutoDetectReader
b) opening a new issue to modify the other three detectors to add metadata for 
unsupported encodings

Then, if a user wants Tika to fallback to a default encoding, s/he can add a 
dummy "AlwaysASCIIEncodingDetector" (or similar) to the list of 
EncodingDetectors.  The client can also check the metadata for unsupported 
encodings that a detector ranked higher than the encoding that was used.  


> CharsetDetector returning unsupported encoding for some 7-bit Outlook/MSG 
> files
> -------------------------------------------------------------------------------
>
>                 Key: TIKA-1236
>                 URL: https://issues.apache.org/jira/browse/TIKA-1236
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Tim Allison
>            Priority: Minor
>
> When parsing a 7-bit encoded Outlook post (.msg without headers), Tika tries 
> to detect the encoding.  For a handful of files, the CharsetDetector returns 
> "IBM424_rtl" with a confidence > the threshold.  This encoding is then set  
> with MAPIMessage.set7BitEncoding().  When MAPI tries to use this encoding, it 
> finds that it is unsupported and throws an exception. 
> Full stacktrace:
> {noformat}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@72ccd846
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookEncoding(OutlookParserTest.java:264)
> ...irrelevant test framework junk...
> Caused by: java.lang.RuntimeException: Encoding not found - IBM424_rtl
>       at 
> org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:149)
>       at 
> org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:85)
>       at 
> org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74)
>       at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:455)
>       at 
> org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:95)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:223)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       ... 26 more
> Caused by: java.io.UnsupportedEncodingException: IBM424_rtl
>       at java.lang.StringCoding.decode(Unknown Source)
>       at java.lang.String.<init>(Unknown Source)
>       at java.lang.String.<init>(Unknown Source)
>       at 
> org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:147)
>       ... 33 more
> {noformat}
> Unfortunately, I can't share the problematic documents, and I can't create a 
> synthetic document that triggers this issue.
> Two questions:
> 1)  Should CharsetDetector return an encoding that is not supported?
> 2)  If so, should we add a simple check before calling set7BitEncoding()?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Comment Edited] (TIKA-1236) CharsetDetector returning unsupported encoding for some 7-bit Outlook/MSG files

Reply via email to