Couple rounds of emails had been exchanged internally regarding this issue years
ago and a proposal had been briefly discussed at this list back to 2011 [1].

The only reason that hold us to just fix it is the "compatibility" concern, since the
described behavior has been there for decade.

It appears we can try it again, as if we are getting more support:-)

Thanks!
-Sherman

[1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html

On 2/25/14 1:30 AM, Hong Dai Thanh wrote:
*Pretext*

The reference implementation has strange behavior when there are nested character classes involved.

This is described briefly in the bug report: https://bugs.openjdk.java.net/browse/JDK-6609854

And also brought up again with more details on stackoverflow.com <http://stackoverflow.com>

http://stackoverflow.com/questions/21934168/double-negation-of-regex-character-classes

*Proposed patch*

I have looked at the current implementation and I'd like to propose the following patch.

The patch is *not yet tested against the test cases in test package*.

However, *the structure has been verified via reflection*.

/(Commented lines are lines to be removed. Lines with trailing comment are those to be added.)/

In method private CharProperty clazz(boolean consume)

                case ']':
                    firstInClass = false;
                    if (prev != null) {
                        if (consume)
                            next();
                        if (!include) { //
                            prev = prev.complement(); //
                        } //
                        return prev;
                    }
                    break;
                default:
                    firstInClass = false;
                    break;
            }
            node = range(bits);
//            if (include) {
                if (prev == null) {
                    prev = node;
                } else {
                    if (prev != node)
                        prev = union(prev, node);
                }
//            } else {
//                if (prev == null) {
//                    prev = node.complement();
//                } else {
//                    if (prev != node)
//                        prev = setDifference(prev, node);
//                }
//            }
            ch = peek();

*Fix Explanation*

In the current code, a nested character class that generates a single Node (for example, [^[^5-6]], [^[^c]], [^[^456]]) will never reach the check if (include) outside the switch statement. As a result, the negation is not applied to the nested character class.

It also causes redundancy with patterns such as [^45[^67]] or [^4589[^67]], since the first character 4 triggers complement() on the BitClass bits and every subsequence character (5, 8, 9) triggers setDifference() on the BitClass bits and the complement of the same BitClass bits.

*Current structure*

[^45[^67]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below: Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
        [U+0034][U+0035]
        45
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
      [U+0034][U+0035]
      45
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class: CharProperty.complement (character class negation). Match any character NOT matched by the following character class: BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
        [U+0036][U+0037]
        67
BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
      [U+0036][U+0037]
      67
LastNode
Node. Accept match

*Structure after patch*

[^45[^67]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class: Pattern.union (character class union). Match any character matched by either character classes below: BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
      [U+0034][U+0035]
      45
CharProperty.complement (character class negation). Match any character NOT matched by the following character class: BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
        [U+0036][U+0037]
        67
LastNode
Node. Accept match

Reply via email to