Thanks Roger, I hadn't seen those existing bugs, so following that trail I get to: https://bugs.openjdk.java.net/browse/JDK-8189343 JDK-8189343: Change of behavior of java.util.regex.Pattern between JDK 8 and JDK 9 Which was resolved as "Not an issue", as the related fix to 6609854 you refer to changed the behaviour... The basic discussion point which can be found here: http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-June/006957.html
basically points out that the negate operator "^" has the "lowest" precedence of the operations by design/spec, so essentially for example: [^a-b&&b-d] is NOT logically [^a-b]&&[b-d] but is in fact ^[[a-b]&&[b-d]] as ^ takes the least precedence compared to && So the new bug JDK-8211526 is in fact working as now designed. I will update it and close it appropriately. Many thanks Andrew Andrew Leonard Java Runtimes Development IBM Hursley IBM United Kingdom Ltd Phone internal: 245913, external: 01962 815913 internet email: andrew_m_leon...@uk.ibm.com From: Roger Riggs <roger.ri...@oracle.com> To: core-libs-dev@openjdk.java.net Date: 07/01/2019 15:33 Subject: Re: JDK-8215626 : Correct [^..&&..] intersection negation behaviour JDK8 vs JDK11 ?? Sent by: "core-libs-dev" <core-libs-dev-boun...@openjdk.java.net> Hi Andrew, Did your investigation lead you to: 6609854: Regex does not match correctly for negative nested character classes That might explain when the behavior changed and perhaps why. $.02, Roger On 01/03/2019 06:20 AM, Andrew Leonard wrote: > Hi, > I'm currently investigating bug JDK-8215626 and have discovered the > problem is in the Pattern interpretation of the [^..&&..] negation when > applied to "intersected" expressions. So I have simplified the bug example > to a more extreme and obvious example: > Input string: "1234 ABCDEFG !$%^& abcdefg" > pattern RegEx: "[^[A-B]&&[^ef]]" > Operation: pattern.matcher(input).replaceAll(""); > > JDK8 output: > 1234 CDEFG !$%^& abcdefg > JDK11 output: > AB > > So from the "spec" : > A character class is a set of characters enclosed within square brackets. > It specifies the characters that will successfully match a single > character from a given input string > Intersection: > To create a single character class matching only the characters common to > all of its nested classes, use &&, as in [0-9&&[345]]. > Negation: > To match all characters except those listed, insert the "^" metacharacter > at the beginning of the character class. > > The way I read the "spec" is the "^" negation negates the whole character > class within the outer square brackets, thus in this example: > "[^[A-B]&&[^ef]]" is equivalent to the negation of "[[A-B]&&[^ef]]" > ie.the negation of the intersect of chars A,B and everything other > than e,f > which is thus the negation of A,B > hence the operation above will remove any character in the input > string other than A,B > Hence, JDK11 in my opinion meets the "spec". It looks as though JDK8 is > applying the ^ negation to just [A-B] and then intersecting it with [^ef], > which to me is the wrong interpretation of the "spec". > > Your thoughts please? > > If JDK11 is correct, and JDK8 wrong, then the next question is do we fix > JDK8? as there's obviously potential "behavioural" impacts to existing > applications....? > > Thanks > Andrew > > Andrew Leonard > Java Runtimes Development > IBM Hursley > IBM United Kingdom Ltd > Phone internal: 245913, external: 01962 815913 > internet email: andrew_m_leon...@uk.ibm.com > > > Unless stated otherwise above: > IBM United Kingdom Limited - Registered in England and Wales with number > 741598. > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU