Re: RFR: 8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char

Xueming Shen Mon, 14 Jul 2025 00:31:39 -0700

On Mon, 14 Jul 2025 05:01:17 GMT, Chen Liang <li...@openjdk.org> wrote:


>> Regex class should conform to **_Level 1_** of [Unicode Technical Standard 
>> #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/), 
>> plus RL2.1 Canonical Equivalents and RL2.2 Extended Grapheme Clusters.
>> 
>> This PR primarily addresses conformance with RL1.5: Simple Loose Matches, 
>> which requires that simple case folding be applied to literals and 
>> (optionally) to character classes. When applied to character classes, each 
>> class is expected to be closed under simple case folding. See the standard 
>> for a detailed explanation of what it means for a class to be “closed.”
>> 
>> To conform with Level 1 of UTS #18, specifically RL1.5: Simple Loose 
>> Matches, simple case folding must be applied to literals and (optionally) to 
>> character classes. When applied to character classes, each character class 
>> is expected to **be closed under simple case folding**.  See the standard 
>> for the detailed explanation and example of "closed".
>> 
>> **RL1.5 states**: 
>> 
>> To meet this requirement, an implementation that supports case-sensitive 
>> matching should
>> 
>>     1. Provide at least the simple, default Unicode case-insensitive 
>> matching, and
>>     2. Specify which character properties or constructs are closed under the 
>> matching.
>> 
>> **In the Pattern implementation**, 5 types of constructs may be affected by 
>> case sensitivity:
>> 
>>     1. back-refs
>>     2. string slices (sequences)
>>     3. single character,
>>     4. character families (Unicode Properties ...), and
>>     5. character class ranges
>> 
>> **Note**: Single characters and families may appear independently or within 
>> a character class.
>> 
>> For case-insensitive (loose) matching, the implementation already applies 
>> Character.toUpperCase() and Character.toLowerCase() to **both the pattern 
>> and the input string** for back-refs, slices, and single characters. This 
>> effectively makes these constructs closed under case folding.
>> 
>> This has been verified in the newly added test case 
>> **_test/jdk/java/util/regex/CaseFoldingTest.java_**.
>> 
>> For example:
>> 
>> Pattern.compile("(?ui)\u017f").matcher("S").matches().      => true
>> Pattern.compile("(?ui)[\u017f]").matcher("S").matches()    => true
>> 
>> The character properties (families)  are not "closed" and should remain 
>> unchanged. This is acceptable per RL1.5, if the  behavior is clearly 
>> specified (TBD: update javadoc to reflect this).
>> 
>> **Current Non-Conformance: Character Class Ranges**, as reported in the 
>> original bug report.
>> 
>> Pattern.compile("(?ui)[\u017f-\u...
>
> src/java.base/share/classes/jdk/internal/util/regex/CaseFolding.java.template 
> line 99:
> 
>> 97:      */
>> 98:     public static int[] getClassRangeClosingCharacters(int start, int 
>> end) {
>> 99:         int[] expanded = new int[expanded_casefolding.size()];
> 
> Can be `Math.min(expanded_casefolding.size(), end - start)` in case the table 
> grows large, and update the `off < expanded.length` check below too.

The table itself probably isn't going to grow significantly anytime soon, and 
we’ll likely have enough time to adjust if CaseFolding.txt does get 
substantially bigger.

That said, I probably should consider reversing the lookup logic: instead of 
iterating through [start, end], we could iterate over the expansion table and 
check whether any of its code points fall within the input range, at least when 
the range size is larger than the size of the table, kinda O(n) vs O(1)-ish.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26285#discussion_r2204044141

Re: RFR: 8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char

Reply via email to