Regex class should conform to **_Level 1_** of [Unicode Technical Standard #18: 
Unicode Regular Expressions](http://www.unicode.org/reports/tr18/), plus RL2.1 
Canonical Equivalents and RL2.2 Extended Grapheme Clusters.

This PR primarily addresses conformance with RL1.5: Simple Loose Matches, which 
requires that simple case folding be applied to literals and (optionally) to 
character classes. When applied to character classes, each class is expected to 
be closed under simple case folding. See the standard for a detailed 
explanation of what it means for a class to be “closed.”

To conform with Level 1 of UTS #18, specifically RL1.5: Simple Loose Matches, 
simple case folding must be applied to literals and (optionally) to character 
classes. When applied to character classes, each character class is expected to 
**be closed under simple case folding**.  See the standard for the detailed 
explanation and example of "closed".

**RL1.5 states**: 

To meet this requirement, an implementation that supports case-sensitive 
matching should

    1. Provide at least the simple, default Unicode case-insensitive matching, 
and
    2. Specify which character properties or constructs are closed under the 
matching.

**In the Pattern implementation**, 5 types of constructs may be affected by 
case sensitivity:

    1. back-refs
    2. string slices (sequences)
    3. single character,
    4. character families (Unicode Properties ...), and
    5. character class ranges

**Note**: Single characters and families may appear independently or within a 
character class.

For case-insensitive (loose) matching, the implementation already applies 
Character.toUpperCase() and Character.toLowerCase() to **both the pattern and 
the input string** for back-refs, slices, and single characters. This 
effectively makes these constructs closed under case folding.

This has been verified in the newly added test case 
**_test/jdk/java/util/regex/CaseFoldingTest.java_**.

For example:

Pattern.compile("(?ui)\u017f").matcher("S").matches().      => true
Pattern.compile("(?ui)[\u017f]").matcher("S").matches()    => true

The character properties (families)  are not "closed" and should remain 
unchanged. This is acceptable per RL1.5, if the  behavior is clearly specified 
(TBD: update javadoc to reflect this).

**Current Non-Conformance: Character Class Ranges**, as reported in the 
original bug report.

Pattern.compile("(?ui)[\u017f-\u017f]").matcher("S").matches()  => false
vs
Pattern.compile("(?ui)[S-S]").matcher("\u017f").matches().         => true

vs Perl. (Perl also claims to support the Unicode's loose match with it it's 
"i" modifier)

perl -C -e 'print "S" =~ /[\x{017f}-\x{017f}]/ ? "true\n" : "false\n"'.  => 
false
perl -C -e 'print "S" =~ /[\x{017f}-\x{017f}]/**_i_** ? "true\n" : "false\n"'. 
=> **_true_**

The root issue is that the range construct is not implemented to be closed 
under simple case folding. Applying toUpperCase() and toLowerCase() to a range 
like [\u0170-\u0180] does not produce a meaningful or valid range for 
case-folding comparisons. For example [\u0170-\u0180]  => [\u0053-\u243] with 
uppercase conversion. 

**What This PR Does**
This PR adds support for ensuring that character class ranges are closed under 
simple case folding when the (?ui) (Unicode case-insensitive) flag is used, 
bringing Pattern into better conformance with UTS #18 Level 1 (RL1.5).

**Notes**

**(1) The PR also tries to fix a special corner case for U+00df** 
see:  https://codepoints.net/U+00DF vs  https://codepoints.net/U+1E9E?lang=en 
for more context.

Pattern.compile("(?ui)\u00df").matcher("\u1e9e").matches() => false
Pattern.compile("(?ui)\u1e9f").matcher("\u00df").matches()  => false

vs

perl -C -e 'print "\x{1e9e}" =~ /\x{df}/ ? "true\n" : "false\n"'  => false
perl -C -e 'print "\x{df}" =~ /\x{1e9e}/ ? "true\n" : "false\n"'  => false
perl -C -e 'print "\x{1e9e}" =~ /\x{df}/i ? "true\n" : "false\n"'  => true
perl -C -e 'print "\x{df}" =~ /\x{1e9e}/i ? "true\n" : "false\n"'  => true

The Java Character class still CORRECTLY returns u+00df for its upper case, as 
suggested by the Unicode. So our toUpperCase() != toLowerCase() in single() 
implementation fails to pick SingleU for case-insensitive matching as expected. 
 

Integer.toHexString(Character.toUpperCase('\u00df')) => 0xdf

**(2) Known limitations: 3 'S'-like characters still fail**

There are 3 characters whose case folding mappings (per CaseFolding.txt) are 
not captured by our current logic, which relies only on Java's 
toUpperCase()/toLowerCase() conversions. These characters cannot be matched 
across constructs like back-ref, slice, single, or range using the current API. 
We will leave them unchanged for now, pending a possible migration to a pure 
case folding based matching implementation.

1FD3; S; 0390; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
1FE3; S; 03B0; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
FB05; S; FB06; # LATIN SMALL LIGATURE LONG S T

**Refs**:
https://bugs.openjdk.org/browse/JDK-6486934
https://bugs.openjdk.org/browse/CCC-6486934
https://cr.openjdk.org/~sherman/6486934_6233084_6504326_6436458/

We are fixing an almost 20-year old bug :-)

-------------

Commit messages:
 - 8360459: UNICODE_CASE and character class with non-ASCII range does not 
match ASCII char

Changes: https://git.openjdk.org/jdk/pull/26285/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26285&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8360459
  Stats: 2044 lines in 8 files changed: 2040 ins; 0 del; 4 mod
  Patch: https://git.openjdk.org/jdk/pull/26285.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/26285/head:pull/26285

PR: https://git.openjdk.org/jdk/pull/26285

Reply via email to