> Regex class should conform to **_Level 1_** of [Unicode Technical Standard > #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/), plus > RL2.1 Canonical Equivalents and RL2.2 Extended Grapheme Clusters. > > This PR primarily addresses conformance with RL1.5: Simple Loose Matches, > which requires that simple case folding be applied to literals and > (optionally) to character classes. When applied to character classes, each > class is expected to be closed under simple case folding. See the standard > for a detailed explanation of what it means for a class to be “**_closed_**.” > > **RL1.5 states**: > > To meet this requirement, an implementation that supports case-sensitive > matching should > > 1. Provide at least the simple, default Unicode case-insensitive > matching, and > 2. Specify which character properties or constructs are closed under the > matching. > > **In the Pattern implementation**, 5 types of constructs may be affected by > case sensitivity: > > 1. back-refs > 2. string slices (sequences) > 3. single character, > 4. character families (Unicode Properties ...), and > 5. character class ranges > > **Note**: Single characters and families may appear independently or within a > character class. > > For case-insensitive (loose) matching, the implementation already applies > Character.toUpperCase() and Character.toLowerCase() to **both the pattern and > the input string** for back-refs, slices, and single characters. This > effectively makes these constructs closed under case folding. > > This has been verified in the newly added test case > **_test/jdk/java/util/regex/CaseFoldingTest.java_**. > > For example: > > Pattern.compile("(?ui)\u017f").matcher("S").matches(). => true > Pattern.compile("(?ui)[\u017f]").matcher("S").matches() => true > > The character properties (families) are not "closed" and should remain > unchanged. This is acceptable per RL1.5, if the behavior is clearly > specified (TBD: update javadoc to reflect this). > > **Current Non-Conformance: Character Class Ranges**, as reported in the > original bug report. > > Pattern.compile("(?ui)[\u017f-\u017f]").matcher("S").matches() => false > vs > Pattern.compile("(?ui)[S-S]").matcher("\u017f").matches(). => true > > vs Perl. (Perl also claims to support the Unicode's loose match with it it's > "i" modifier) > > perl -C -e 'print "S" =~ /[\x{017f}-\x{017f}]/ ? "true\n" : "false\n"'. => > false > perl -C -e 'print "S" =~ /[\x{017f}-\x{017f}]/**_i_** ? "true\n" : > "false\n"'. => **_true_** > > The root issue is that the ran...
Xueming Shen has updated the pull request incrementally with one additional commit since the last revision: update to address the review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26285/files - new: https://git.openjdk.org/jdk/pull/26285/files/735bd722..e18d2668 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26285&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26285&range=01-02 Stats: 11 lines in 3 files changed: 0 ins; 4 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/26285.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26285/head:pull/26285 PR: https://git.openjdk.org/jdk/pull/26285