Re: RFR: JDK-6609854: Regex does not match correctly for negative nested character classes

Xueming Shen Tue, 25 Feb 2014 08:08:54 -0800

Couple rounds of emails had been exchanged internally regarding thisissue years

ago and a proposal had been briefly discussed at this list back to 2011 [1].

The only reason that hold us to just fix it is the "compatibility"concern, since the

described behavior has been there for decade.

It appears we can try it again, as if we are getting more support:-)

Thanks!
-Sherman

[1]http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html


On 2/25/14 1:30 AM, Hong Dai Thanh wrote:

*Pretext*
The reference implementation has strange behavior when there arenested character classes involved.
This is described briefly in the bug report:https://bugs.openjdk.java.net/browse/JDK-6609854
And also brought up again with more details on stackoverflow.com<http://stackoverflow.com>
http://stackoverflow.com/questions/21934168/double-negation-of-regex-character-classes

*Proposed patch*
I have looked at the current implementation and I'd like to proposethe following patch.
The patch is *not yet tested against the test cases in test package*.

However, *the structure has been verified via reflection*.
/(Commented lines are lines to be removed. Lines with trailing commentare those to be added.)/
In method private CharProperty clazz(boolean consume)

                case ']':
                    firstInClass = false;
                    if (prev != null) {
                        if (consume)
                            next();
                        if (!include) { //
                            prev = prev.complement(); //
                        } //
                        return prev;
                    }
                    break;
                default:
                    firstInClass = false;
                    break;
            }
            node = range(bits);
//            if (include) {
                if (prev == null) {
                    prev = node;
                } else {
                    if (prev != node)
                        prev = union(prev, node);
                }
//            } else {
//                if (prev == null) {
//                    prev = node.complement();
//                } else {
//                    if (prev != node)
//                        prev = setDifference(prev, node);
//                }
//            }
            ch = peek();

*Fix Explanation*
In the current code, a nested character class that generates a singleNode (for example, [^[^5-6]], [^[^c]], [^[^456]]) will never reach thecheck if (include) outside the switch statement. As a result, thenegation is not applied to the nested character class.
It also causes redundancy with patterns such as [^45[^67]] or[^4589[^67]], since the first character 4 triggers complement() on theBitClass bits and every subsequence character (5, 8, 9) triggerssetDifference() on the BitClass bits and the complement of the sameBitClass bits.
*Current structure*

[^45[^67]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched byeither character classes below:Pattern.setDifference (character class subtraction). Match anycharacter matched by the 1st character class, but NOT the 2ndcharacter class:CharProperty.complement (character class negation). Match anycharacter NOT matched by the following character class:BitClass. Optimized character class with boolean[] to matchcharacters in Latin-1 (code point <= 255). Match the following 2character(s):
        [U+0034][U+0035]
        45
BitClass. Optimized character class with boolean[] to matchcharacters in Latin-1 (code point <= 255). Match the following 2character(s):
      [U+0034][U+0035]
      45
Pattern.setDifference (character class subtraction). Match anycharacter matched by the 1st character class, but NOT the 2ndcharacter class:CharProperty.complement (character class negation). Match anycharacter NOT matched by the following character class:BitClass. Optimized character class with boolean[] to matchcharacters in Latin-1 (code point <= 255). Match the following 2character(s):
        [U+0036][U+0037]
        67
BitClass. Optimized character class with boolean[] to matchcharacters in Latin-1 (code point <= 255). Match the following 2character(s):
      [U+0036][U+0037]
      67
LastNode
Node. Accept match

*Structure after patch*

[^45[^67]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match anycharacter NOT matched by the following character class:Pattern.union (character class union). Match any character matchedby either character classes below:BitClass. Optimized character class with boolean[] to matchcharacters in Latin-1 (code point <= 255). Match the following 2character(s):
      [U+0034][U+0035]
      45
CharProperty.complement (character class negation). Match anycharacter NOT matched by the following character class:BitClass. Optimized character class with boolean[] to matchcharacters in Latin-1 (code point <= 255). Match the following 2character(s):
        [U+0036][U+0037]
        67
LastNode
Node. Accept match

Re: RFR: JDK-6609854: Regex does not match correctly for negative nested character classes

Reply via email to