[jira] [Comment Edited] (TEXT-192) HTML unescape does not parse Windows-1252 correctly

Cyril Parsons (Jira) Sat, 12 Dec 2020 06:37:07 -0800


    [ 
https://issues.apache.org/jira/browse/TEXT-192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248363#comment-17248363
 ]


Cyril Parsons edited comment on TEXT-192 at 12/12/20, 2:36 PM:
---------------------------------------------------------------

Thanks, I coded up some new methods and mappings for CP-1252. See 
[https://github.com/apache/commons-text/pull/190]. That said, I have no idea 
how to fix the unit tests which are throwing build errors, given that the unit 
test does something where it counts `new HashMap<>()` calls and 
`initialMap.put` calls when the map is constructed in a way doesn't mesh with 
that.

I guess I could fix by refactoring the entire ISO 8859-1 section into CP-1252 
but I'm not that bold.

(I also have no idea how to build the project locally at all.)


was (Author: cpars509):
Thanks, I coded up some new methods and mappings for CP-1252. See 
[https://github.com/apache/commons-text/pull/190]. That said, I have no idea 
how to fix the unit tests which are throwing build errors.

(Or for that matter, how to build the project at all.)

> HTML unescape does not parse Windows-1252 correctly
> ---------------------------------------------------
>
>                 Key: TEXT-192
>                 URL: https://issues.apache.org/jira/browse/TEXT-192
>             Project: Commons Text
>          Issue Type: Bug
>    Affects Versions: 1.8
>         Environment: Java, macOS; should not be platform specific
>            Reporter: Cyril Parsons
>            Priority: Minor
>
> Looking at [https://en.wikipedia.org/wiki/Windows-1252#Character_set] there 
> are differences in parsing Windows-1252 and ISO 8859-1. Code points between 
> 128 and 159 (on Windows-1252) are improperly decoded.
> In a MMVE:
> {code:java}
> import org.apache.commons.text.StringEscapeUtils;
> ...
> String w1252 = "&#126;&#151;&#161;";
> String output = StringEscapeUtils.unescapeHtml4(w1252);
> System.out.println(output);
> System.out.println(output.chars().mapToLong(Long::valueOf)
>         .boxed().collect(Collectors.toList()));
> {code}
> The output is:
> {code:java}
> ~ ¡
> [126, 151, 161]
> {code}
> (Space substituted for the Unicode character "End Of Guarded Area".) Expected 
> output would be that em-dash would appear. Code points right outside of the 
> Windows-1252/ISO 8859-1 inconsistency zone are all okay. Looking at the 
> source for how StringEscapeUtils.UNESCAPE_HTML4 works, I think (big question 
> marks here) that an escape set needs to be added for Windows-1252?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TEXT-192) HTML unescape does not parse Windows-1252 correctly

Reply via email to