Re: RFR: 8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char [v5]

Xueming Shen Tue, 15 Jul 2025 09:58:16 -0700

On Tue, 15 Jul 2025 15:11:07 GMT, Xueming Shen <sher...@openjdk.org> wrote:


>> Regex class should conform to **_Level 1_** of [Unicode Technical Standard 
>> #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/), 
>> plus RL2.1 Canonical Equivalents and RL2.2 Extended Grapheme Clusters.
>> 
>> This PR primarily addresses conformance with RL1.5: Simple Loose Matches, 
>> which requires that simple case folding be applied to literals and 
>> (optionally) to character classes. When applied to character classes, each 
>> class is expected to be closed under simple case folding. See the standard 
>> for a detailed explanation of what it means for a class to be “**_closed_**.”
>> 
>> **RL1.5 states**: 
>> 
>> To meet this requirement, an implementation that supports case-sensitive 
>> matching should
>> 
>>     1. Provide at least the simple, default Unicode case-insensitive 
>> matching, and
>>     2. Specify which character properties or constructs are closed under the 
>> matching.
>> 
>> **In the Pattern implementation**, 5 types of constructs may be affected by 
>> case sensitivity:
>> 
>>     1. back-refs
>>     2. string slices (sequences)
>>     3. single character,
>>     4. character families (Unicode Properties ...), and
>>     5. character class ranges
>> 
>> **Note**: Single characters and families may appear independently or within 
>> a character class.
>> 
>> For case-insensitive (loose) matching, the implementation already applies 
>> Character.toUpperCase() and Character.toLowerCase() to **both the pattern 
>> and the input string** for back-refs, slices, and single characters. This 
>> effectively makes these constructs closed under case folding.
>> 
>> This has been verified in the newly added test case 
>> **_test/jdk/java/util/regex/CaseFoldingTest.java_**.
>> 
>> For example:
>> 
>> Pattern.compile("(?ui)\u017f").matcher("S").matches().      => true
>> Pattern.compile("(?ui)[\u017f]").matcher("S").matches()    => true
>> 
>> The character properties (families)  are not "closed" and should remain 
>> unchanged. This is acceptable per RL1.5, if the  behavior is clearly 
>> specified (TBD: update javadoc to reflect this).
>> 
>> **Current Non-Conformance: Character Class Ranges**, as reported in the 
>> original bug report.
>> 
>> Pattern.compile("(?ui)[\u017f-\u017f]").matcher("S").matches()  => false
>> vs
>> Pattern.compile("(?ui)[S-S]").matcher("\u017f").matches().         => true
>> 
>> vs Perl. (Perl also claims to support the Unicode's loose match with it it's 
>> "i" modifier)
>> 
>> perl -C -e 'print "S" =~ /[\x{017f}-\x{017f}]/ ? "true\n" : "false\n"'.  => 
>> false
>> perl -C -e 'print "S" =~ /[\x{017f}-\x{0...
>
> Xueming Shen has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   improve the lookup logic and test case for +00df

Thanks for the reviews!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26285#issuecomment-3074413884

Re: RFR: 8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char [v5]

Reply via email to