Re: RFR: 8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char [v3]

Xueming Shen Mon, 14 Jul 2025 13:14:14 -0700

> Regex class should conform to **_Level 1_** of [Unicode Technical Standard 
> #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/), plus 
> RL2.1 Canonical Equivalents and RL2.2 Extended Grapheme Clusters.
> 
> This PR primarily addresses conformance with RL1.5: Simple Loose Matches, 
> which requires that simple case folding be applied to literals and 
> (optionally) to character classes. When applied to character classes, each 
> class is expected to be closed under simple case folding. See the standard 
> for a detailed explanation of what it means for a class to be “**_closed_**.”
> 
> **RL1.5 states**: 
> 
> To meet this requirement, an implementation that supports case-sensitive 
> matching should
> 
>     1. Provide at least the simple, default Unicode case-insensitive 
> matching, and
>     2. Specify which character properties or constructs are closed under the 
> matching.
> 
> **In the Pattern implementation**, 5 types of constructs may be affected by 
> case sensitivity:
> 
>     1. back-refs
>     2. string slices (sequences)
>     3. single character,
>     4. character families (Unicode Properties ...), and
>     5. character class ranges
> 
> **Note**: Single characters and families may appear independently or within a 
> character class.
> 
> For case-insensitive (loose) matching, the implementation already applies 
> Character.toUpperCase() and Character.toLowerCase() to **both the pattern and 
> the input string** for back-refs, slices, and single characters. This 
> effectively makes these constructs closed under case folding.
> 
> This has been verified in the newly added test case 
> **_test/jdk/java/util/regex/CaseFoldingTest.java_**.
> 
> For example:
> 
> Pattern.compile("(?ui)\u017f").matcher("S").matches().      => true
> Pattern.compile("(?ui)[\u017f]").matcher("S").matches()    => true
> 
> The character properties (families)  are not "closed" and should remain 
> unchanged. This is acceptable per RL1.5, if the  behavior is clearly 
> specified (TBD: update javadoc to reflect this).
> 
> **Current Non-Conformance: Character Class Ranges**, as reported in the 
> original bug report.
> 
> Pattern.compile("(?ui)[\u017f-\u017f]").matcher("S").matches()  => false
> vs
> Pattern.compile("(?ui)[S-S]").matcher("\u017f").matches().         => true
> 
> vs Perl. (Perl also claims to support the Unicode's loose match with it it's 
> "i" modifier)
> 
> perl -C -e 'print "S" =~ /[\x{017f}-\x{017f}]/ ? "true\n" : "false\n"'.  => 
> false
> perl -C -e 'print "S" =~ /[\x{017f}-\x{017f}]/**_i_** ? "true\n" : 
> "false\n"'. => **_true_**
> 
> The root issue is that the ran...


Xueming Shen has updated the pull request incrementally with one additional 
commit since the last revision:

  update to address the review comments

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/26285/files
  - new: https://git.openjdk.org/jdk/pull/26285/files/735bd722..e18d2668

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=26285&range=02
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26285&range=01-02

  Stats: 11 lines in 3 files changed: 0 ins; 4 del; 7 mod
  Patch: https://git.openjdk.org/jdk/pull/26285.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/26285/head:pull/26285

PR: https://git.openjdk.org/jdk/pull/26285

Re: RFR: 8360459: UNICODE_CASE and character class with non-ASCII range does not match ASCII char [v3]

Reply via email to