Re: RFR: 8354968: Replace unicode sequences in comment text with UTF-8 characters [v2]

Naoto Sato Tue, 06 May 2025 10:21:49 -0700

On Tue, 6 May 2025 15:46:03 GMT, Magnus Ihse Bursie <i...@openjdk.org> wrote:


>> As part of the UTF-8 cleaning up done in 
>> [JDK-8301971](https://bugs.openjdk.org/browse/JDK-8301971), I looked at 
>> where and how we are using unicode sequences (`\uXXXX`). In several string 
>> literals, I think the unicode sequences still has merit, if they improve 
>> clarity or readability of the code. Some instances are more gray zone. But 
>> the places where it does not make sense at all are in comments, as part of 
>> fluid text comments. There they are just disruptive and not helpful at all. 
>> I tried to locate all such places (but I might have missed places, I did not 
>> do a proper lexical analysis to find comments) and fix them.
>> 
>> 99% of this fix is to turn poor `Peter von der Ah\u00e9` into `Peter von der 
>> Ahé`. 😆 
>> 
>> I checked some random samples on when this was introduced to see if there 
>> were some particular commit that mistreated the encoding, but they have been 
>> there since the original release of the open JDK source code.
>> 
>> There are likely many more places where direct UTF-8 encoded characters is 
>> preferable to unicode sequences, but this seemed like a safe and trivial 
>> first start.
>
> Magnus Ihse Bursie has updated the pull request with a new target base due to 
> a merge or a rebase. The incremental webrev excludes the unrelated changes 
> brought in by the merge/rebase. The pull request contains two additional 
> commits since the last revision:
> 
>  - Merge branch 'master' into unicode-sequence-in-comments
>  - 8354968: Replace unicode sequences in comment text with UTF-8 characters

src/java.base/share/classes/java/text/Collator.java line 141:

> 139:      * considered significant during comparison. The assignment of 
> strengths
> 140:      * to language features is locale dependent. A common example is for
> 141:      * different accented forms of the same base letter ("a" vs "ä") to 
> be

Since this (and the other one in RuleBasedCollator) is in the explanation of 
text handling, I think keeping the original code point makes sense. So I'd have 
both UTF-8 string and its Unicode escape notation here.

src/java.base/share/classes/java/text/RuleBasedCollator.java line 594:

> 592:         // a three-digit number, one digit for primary, one for 
> secondary, etc.
> 593:         //
> 594:         // String:              A     a     B   é

Maybe "é (\u00e9, e-acute)"?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/24727#discussion_r2075933987
PR Review Comment: https://git.openjdk.org/jdk/pull/24727#discussion_r2075935811

Re: RFR: 8354968: Replace unicode sequences in comment text with UTF-8 characters [v2]

Reply via email to