[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

Tim Allison (JIRA) Fri, 01 Sep 2017 08:22:38 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150712#comment-16150712
 ]


Tim Allison commented on TIKA-2219:
-----------------------------------

Not sure if there's anything we can do here.

The attached file doesn't go through the charset detector at all because it is 
parsed by the RFC822Parser.  If you do force it to go through the TXTParser, it 
is correctly id'd as windows-1252.

IIRC, the correct way to encode windows-1252 in RFC822  should be something 
like {{?windows-1252?Q?100_=80?}}.

Any recommended fix?

> CharsetDetector no longer detects windows-1252 charset
> ------------------------------------------------------
>
>                 Key: TIKA-2219
>                 URL: https://issues.apache.org/jira/browse/TIKA-2219
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>         Environment: Any.
>            Reporter: Pascal Essiembre
>            Priority: Minor
>             Fix For: 2.0, 1.15
>
>         Attachments: test.txt
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is 
> always detected instead.  While not tested, this likely affects other 
> windows-125* encodings as well.
> I tracked it down to a change in the 
> {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always 
> returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? 
> "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} 
> method so that the returned CharsetMatch has the proper name.  The problem 
> with that is {{CharsetDetector#detectAll()}} method overwrites the correct 
> match with a new one that will return the value of {{#getName()}}  from the 
> {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in 
> {{detectAll()}} method are replaced with new ones, but changing this code in 
> that method appears to work for me:
> // Remove this:
> //                    CharsetMatch m = new CharsetMatch(this, csr, 
> confidence);
> //                    matches.add(m);
> // Add this instead:
>                     matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

Reply via email to