Couple rounds of emails had been exchanged internally regarding this
issue years
ago and a proposal had been briefly discussed at this list back to 2011 [1].
The only reason that hold us to just fix it is the "compatibility"
concern, since the
described behavior has been there for decade.
It appears we can try it again, as if we are getting more support:-)
Thanks!
-Sherman
[1]
http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html
On 2/25/14 1:30 AM, Hong Dai Thanh wrote:
*Pretext*
The reference implementation has strange behavior when there are
nested character classes involved.
This is described briefly in the bug report:
https://bugs.openjdk.java.net/browse/JDK-6609854
And also brought up again with more details on stackoverflow.com
<http://stackoverflow.com>
http://stackoverflow.com/questions/21934168/double-negation-of-regex-character-classes
*Proposed patch*
I have looked at the current implementation and I'd like to propose
the following patch.
The patch is *not yet tested against the test cases in test package*.
However, *the structure has been verified via reflection*.
/(Commented lines are lines to be removed. Lines with trailing comment
are those to be added.)/
In method private CharProperty clazz(boolean consume)
case ']':
firstInClass = false;
if (prev != null) {
if (consume)
next();
if (!include) { //
prev = prev.complement(); //
} //
return prev;
}
break;
default:
firstInClass = false;
break;
}
node = range(bits);
// if (include) {
if (prev == null) {
prev = node;
} else {
if (prev != node)
prev = union(prev, node);
}
// } else {
// if (prev == null) {
// prev = node.complement();
// } else {
// if (prev != node)
// prev = setDifference(prev, node);
// }
// }
ch = peek();
*Fix Explanation*
In the current code, a nested character class that generates a single
Node (for example, [^[^5-6]], [^[^c]], [^[^456]]) will never reach the
check if (include) outside the switch statement. As a result, the
negation is not applied to the nested character class.
It also causes redundancy with patterns such as [^45[^67]] or
[^4589[^67]], since the first character 4 triggers complement() on the
BitClass bits and every subsequence character (5, 8, 9) triggers
setDifference() on the BitClass bits and the complement of the same
BitClass bits.
*Current structure*
[^45[^67]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by
either character classes below:
Pattern.setDifference (character class subtraction). Match any
character matched by the 1st character class, but NOT the 2nd
character class:
CharProperty.complement (character class negation). Match any
character NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match
characters in Latin-1 (code point <= 255). Match the following 2
character(s):
[U+0034][U+0035]
45
BitClass. Optimized character class with boolean[] to match
characters in Latin-1 (code point <= 255). Match the following 2
character(s):
[U+0034][U+0035]
45
Pattern.setDifference (character class subtraction). Match any
character matched by the 1st character class, but NOT the 2nd
character class:
CharProperty.complement (character class negation). Match any
character NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match
characters in Latin-1 (code point <= 255). Match the following 2
character(s):
[U+0036][U+0037]
67
BitClass. Optimized character class with boolean[] to match
characters in Latin-1 (code point <= 255). Match the following 2
character(s):
[U+0036][U+0037]
67
LastNode
Node. Accept match
*Structure after patch*
[^45[^67]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any
character NOT matched by the following character class:
Pattern.union (character class union). Match any character matched
by either character classes below:
BitClass. Optimized character class with boolean[] to match
characters in Latin-1 (code point <= 255). Match the following 2
character(s):
[U+0034][U+0035]
45
CharProperty.complement (character class negation). Match any
character NOT matched by the following character class:
BitClass. Optimized character class with boolean[] to match
characters in Latin-1 (code point <= 255). Match the following 2
character(s):
[U+0036][U+0037]
67
LastNode
Node. Accept match