On Tue, 15 Jul 2025 15:11:07 GMT, Xueming Shen <sher...@openjdk.org> wrote:
>> Regex class should conform to **_Level 1_** of [Unicode Technical Standard >> #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/), >> plus RL2.1 Canonical Equivalents and RL2.2 Extended Grapheme Clusters. >> >> This PR primarily addresses conformance with RL1.5: Simple Loose Matches, >> which requires that simple case folding be applied to literals and >> (optionally) to character classes. When applied to character classes, each >> class is expected to be closed under simple case folding. See the standard >> for a detailed explanation of what it means for a class to be “**_closed_**.” >> >> **RL1.5 states**: >> >> To meet this requirement, an implementation that supports case-sensitive >> matching should >> >> 1. Provide at least the simple, default Unicode case-insensitive >> matching, and >> 2. Specify which character properties or constructs are closed under the >> matching. >> >> **In the Pattern implementation**, 5 types of constructs may be affected by >> case sensitivity: >> >> 1. back-refs >> 2. string slices (sequences) >> 3. single character, >> 4. character families (Unicode Properties ...), and >> 5. character class ranges >> >> **Note**: Single characters and families may appear independently or within >> a character class. >> >> For case-insensitive (loose) matching, the implementation already applies >> Character.toUpperCase() and Character.toLowerCase() to **both the pattern >> and the input string** for back-refs, slices, and single characters. This >> effectively makes these constructs closed under case folding. >> >> This has been verified in the newly added test case >> **_test/jdk/java/util/regex/CaseFoldingTest.java_**. >> >> For example: >> >> Pattern.compile("(?ui)\u017f").matcher("S").matches(). => true >> Pattern.compile("(?ui)[\u017f]").matcher("S").matches() => true >> >> The character properties (families) are not "closed" and should remain >> unchanged. This is acceptable per RL1.5, if the behavior is clearly >> specified (TBD: update javadoc to reflect this). >> >> **Current Non-Conformance: Character Class Ranges**, as reported in the >> original bug report. >> >> Pattern.compile("(?ui)[\u017f-\u017f]").matcher("S").matches() => false >> vs >> Pattern.compile("(?ui)[S-S]").matcher("\u017f").matches(). => true >> >> vs Perl. (Perl also claims to support the Unicode's loose match with it it's >> "i" modifier) >> >> perl -C -e 'print "S" =~ /[\x{017f}-\x{017f}]/ ? "true\n" : "false\n"'. => >> false >> perl -C -e 'print "S" =~ /[\x{017f}-\x{0... > > Xueming Shen has updated the pull request incrementally with one additional > commit since the last revision: > > improve the lookup logic and test case for +00df Thanks for the reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/26285#issuecomment-3074413884