Re: RFR 8214245 : (regex) Case insensitive matching doesn't work correctly for some character classes

Roger Riggs Tue, 25 Feb 2020 09:00:09 -0800

Hi Ivan,

The CSR and release note look fine.


Roger


On 2/24/20 8:54 PM, Ivan Gerasimov wrote:

Thank you Roger and Joe for the feedback!
May I please ask you to review the CSR draft [1] and the Release Notes[2] for this issue:
[1] https://bugs.openjdk.java.net/browse/JDK-8238984

[2] https://bugs.openjdk.java.net/browse/JDK-8239887

Thanks in advance!

Ivan


On 2/11/20 3:10 PM, Joe Darcy wrote:
Hello,
Yes, I believe this change should have a CSR, and most likely arelease note.
Thanks,

-Joe

On 2/11/2020 12:49 PM, Roger Riggs wrote:
Hi Ivan,

Will this have enough of a compatibility concern to warrant a CSR?
It will change the behavor of these cases.
In the RegExTest, the failures should print which case is failing.(Line 4961, 4990).
Regards, Roger


On 2/7/20 3:05 PM, Ivan Gerasimov wrote:
Gentle ping.
I had to rebase the fix, as the code has diverged since the RFR wassent out 10 months ago.
Also, the test was slightly modified to cover more cases.

BUGURL: https://bugs.openjdk.java.net/browse/JDK-8214245
WEBREV: http://cr.openjdk.java.net/~igerasim/8214245/01/webrev/

Thanks in advance to the volunteer to review the fix!

With kind regards,

Ivan

On 4/21/19 7:50 PM, Ivan Gerasimov wrote:
Hello!
It turns out, that the case-insensitive j.u.regex.Pattern stillpays attention to the characters case when certain char classesare used.For example \p{IsLowerCase}, \p{IsUpperCase} and \p{IsTitleCase}continue to recognize only lower, upper and title case characters,even in case-insensitive context.
For example, for POSIX char classes this behavior contradicts thisparagraph:
"""
9.2 Regular Expression General Requirements
...
When a standard utility or function that uses regular expressionsspecifies that pattern matching shall be performed without regardto the case (uppercase or lowercase) of either data or patterns,then when each character in the string is matched against thepattern, not only the character, but also its case counterpart (ifany), shall be matched. This definition of case-insensitiveprocessing is intended to allow matching of multi-charactercollating elements as well as characters, as each character in thestring is matched using both its cases.
...
"""
I also checked how Perl is dealing with in such situation, andyes, it ignores the case with various \p{} classes when they areused in case-insensitive context, so all these tests run fine:
'A' =~ /\p{Lower}/i or die;
'a' =~ /\p{Upper}/i or die;
'A' =~ /\p{gc=Lt}/i or die; # title case
'a' =~ /\p{IsTitlecase}/i or die;
'ǈ' =~ /\p{Lower}/i or die; # title-cased digraph
'ǉ' =~ /\p{Upper}/i or die;
'Ǉ' =~ /\p{Lt}/i or die;
For reference, here's a lengthy document, describing precise rulesused by Perl to deal with \p{} char classes:https://perldoc.perl.org/perluniprops.html#Properties-accessible-through-%5cp%7b%7d-and-%5cP%7b%7d
So, for any Lower, Upper or Title case chars in case-insensitivecontext Perl uses set of "Cased Letters", with is just acombination of these three categories (aka "LC" general category).
Would you please help review the patch?

BUGURL: https://bugs.openjdk.java.net/browse/JDK-8214245
WEBREV: http://cr.openjdk.java.net/~igerasim/8214245/00/webrev/

Re: RFR 8214245 : (regex) Case insensitive matching doesn't work correctly for some character classes

Reply via email to