> Regex class should conform to **_Level 1_** of [Unicode Technical Standard > #18: Unicode Regular Expressions](http://www.unicode.org/reports/tr18/), plus > RL2.1 Canonical Equivalents and RL2.2 Extended Grapheme Clusters. > > This PR primarily addresses conformance with RL1.5: Simple Loose Matches, > which requires that simple case folding be applied to literals and > (optionally) to character classes. When applied to character classes, each > class is expected to be closed under simple case folding. See the standard > for a detailed explanation of what it means for a class to be “closed.” > > To conform with Level 1 of UTS #18, specifically RL1.5: Simple Loose Matches, > simple case folding must be applied to literals and (optionally) to character > classes. When applied to character classes, each character class is expected > to **be closed under simple case folding**. See the standard for the > detailed explanation and example of "closed". > > **RL1.5 states**: > > To meet this requirement, an implementation that supports case-sensitive > matching should > > 1. Provide at least the simple, default Unicode case-insensitive > matching, and > 2. Specify which character properties or constructs are closed under the > matching. > > **In the Pattern implementation**, 5 types of constructs may be affected by > case sensitivity: > > 1. back-refs > 2. string slices (sequences) > 3. single character, > 4. character families (Unicode Properties ...), and > 5. character class ranges > > **Note**: Single characters and families may appear independently or within a > character class. > > For case-insensitive (loose) matching, the implementation already applies > Character.toUpperCase() and Character.toLowerCase() to **both the pattern and > the input string** for back-refs, slices, and single characters. This > effectively makes these constructs closed under case folding. > > This has been verified in the newly added test case > **_test/jdk/java/util/regex/CaseFoldingTest.java_**. > > For example: > > Pattern.compile("(?ui)\u017f").matcher("S").matches(). => true > Pattern.compile("(?ui)[\u017f]").matcher("S").matches() => true > > The character properties (families) are not "closed" and should remain > unchanged. This is acceptable per RL1.5, if the behavior is clearly > specified (TBD: update javadoc to reflect this). > > **Current Non-Conformance: Character Class Ranges**, as reported in the > original bug report. > > Pattern.compile("(?ui)[\u017f-\u017f]").matcher("S").matches() => false > vs > Pattern.compile("(?ui)[S-S]")....
Xueming Shen has updated the pull request incrementally with one additional commit since the last revision: update to address the review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/26285/files - new: https://git.openjdk.org/jdk/pull/26285/files/640d7a61..735bd722 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=26285&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=26285&range=00-01 Stats: 40 lines in 2 files changed: 7 ins; 12 del; 21 mod Patch: https://git.openjdk.org/jdk/pull/26285.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/26285/head:pull/26285 PR: https://git.openjdk.org/jdk/pull/26285