[
https://issues.apache.org/jira/browse/TEXT-192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248415#comment-17248415
]
Cyril Parsons edited comment on TEXT-192 at 12/12/20, 5:33 PM:
---------------------------------------------------------------
Thanks! IntelliJ's Maven integration seemed to do the trick. That said, I'm not
sure why I'm getting a build unit test error (following) when I'm trying to
build, at least when I've done nothing to the specific class:
{code:java}
DateStringLookupTest.testDefault:46 » Parse Unparseable date: "12/12/20 12:21"
{code}
I've only altered StringEscapeUtils, EntityArrays, and the former's test class.
This is especially confusing because the GitHub tests don't replicate (I
think?). The pull request I made says that all tests passed.
was (Author: cpars509):
Thanks! IntelliJ's Maven integration seemed to do the trick. That said, I'm not
sure why I'm getting a build unit test error (following) when I'm trying to
build, at least when I've done nothing to the specific class:
{code:java}
DateStringLookupTest.testDefault:46 » Parse Unparseable date: "12/12/20 12:21"
{code}
I've only altered StringEscapeUtils, EntityArrays, and the former's test class.
> HTML unescape does not parse Windows-1252 correctly
> ---------------------------------------------------
>
> Key: TEXT-192
> URL: https://issues.apache.org/jira/browse/TEXT-192
> Project: Commons Text
> Issue Type: Bug
> Affects Versions: 1.8
> Environment: Java, macOS; should not be platform specific
> Reporter: Cyril Parsons
> Priority: Minor
>
> Looking at [https://en.wikipedia.org/wiki/Windows-1252#Character_set] there
> are differences in parsing Windows-1252 and ISO 8859-1. Code points between
> 128 and 159 (on Windows-1252) are improperly decoded.
> In a MMVE:
> {code:java}
> import org.apache.commons.text.StringEscapeUtils;
> ...
> String w1252 = "~—¡";
> String output = StringEscapeUtils.unescapeHtml4(w1252);
> System.out.println(output);
> System.out.println(output.chars().mapToLong(Long::valueOf)
> .boxed().collect(Collectors.toList()));
> {code}
> The output is:
> {code:java}
> ~ ¡
> [126, 151, 161]
> {code}
> (Space substituted for the Unicode character "End Of Guarded Area".) Expected
> output would be that em-dash would appear. Code points right outside of the
> Windows-1252/ISO 8859-1 inconsistency zone are all okay. Looking at the
> source for how StringEscapeUtils.UNESCAPE_HTML4 works, I think (big question
> marks here) that an escape set needs to be added for Windows-1252?
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)