[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 Jonathan Wakely changed: What|Removed |Added Target Milestone|11.4|10.5 --- Comment #14 from Jonathan Wakely --- Backported for 10.5 too.
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 --- Comment #13 from CVS Commits --- The releases/gcc-10 branch has been updated by Jonathan Wakely : https://gcc.gnu.org/g:4c347b8d59958d5aa76c5fdcecd72478e08c5aa3 commit r10-11465-g4c347b8d59958d5aa76c5fdcecd72478e08c5aa3 Author: Jonathan Wakely Date: Tue Dec 14 14:32:35 2021 + libstdc++: Fix handling of invalid ranges in std::regex [PR102447] std::regex currently allows invalid bracket ranges such as [\w-a] which are only allowed by ECMAScript when in web browser compatibility mode. It should be an error, because the start of the range is a character class, not a single character. The current implementation of _Compiler::_M_expression_term does not provide a way to reject this, because we only remember a previous character, not whether we just processed a character class (or collating symbol etc.) This patch replaces the pair used to emulate optional with a custom class closer to pair. That allows us to track three states, so that we can tell when we've just seen a character class. With this additional state the code in _M_expression_term for processing the _S_token_bracket_dash can be improved to correctly reject the [\w-a] case, without regressing for valid cases such as [\w-] and []. libstdc++-v3/ChangeLog: PR libstdc++/102447 * include/bits/regex_compiler.h (_Compiler::_BracketState): New class. (_Compiler::_BrackeyMatcher): New alias template. (_Compiler::_M_expression_term): Change pair parameter to _BracketState. Process first character for ECMAScript syntax as well as POSIX. * include/bits/regex_compiler.tcc (_Compiler::_M_insert_bracket_matcher): Pass _BracketState. (_Compiler::_M_expression_term): Use _BracketState to store state between calls. Improve handling of dashes in ranges. * testsuite/28_regex/algorithms/regex_match/cstring_bracket_01.cc: Add more tests for ranges containing dashes. Check invalid ranges with character class at the beginning. (cherry picked from commit 7ce3c230edf6e498e125c805a6dd313bf87dc439)
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 Jonathan Wakely changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED Target Milestone|--- |11.4 --- Comment #12 from Jonathan Wakely --- Fixed for 11.4 too.
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 --- Comment #11 from CVS Commits --- The releases/gcc-11 branch has been updated by Jonathan Wakely : https://gcc.gnu.org/g:c725028a8bb9478ec84332641147ad12b9236922 commit r11-10130-gc725028a8bb9478ec84332641147ad12b9236922 Author: Jonathan Wakely Date: Tue Dec 14 14:32:35 2021 + libstdc++: Fix handling of invalid ranges in std::regex [PR102447] std::regex currently allows invalid bracket ranges such as [\w-a] which are only allowed by ECMAScript when in web browser compatibility mode. It should be an error, because the start of the range is a character class, not a single character. The current implementation of _Compiler::_M_expression_term does not provide a way to reject this, because we only remember a previous character, not whether we just processed a character class (or collating symbol etc.) This patch replaces the pair used to emulate optional with a custom class closer to pair. That allows us to track three states, so that we can tell when we've just seen a character class. With this additional state the code in _M_expression_term for processing the _S_token_bracket_dash can be improved to correctly reject the [\w-a] case, without regressing for valid cases such as [\w-] and []. libstdc++-v3/ChangeLog: PR libstdc++/102447 * include/bits/regex_compiler.h (_Compiler::_BracketState): New class. (_Compiler::_BrackeyMatcher): New alias template. (_Compiler::_M_expression_term): Change pair parameter to _BracketState. Process first character for ECMAScript syntax as well as POSIX. * include/bits/regex_compiler.tcc (_Compiler::_M_insert_bracket_matcher): Pass _BracketState. (_Compiler::_M_expression_term): Use _BracketState to store state between calls. Improve handling of dashes in ranges. * testsuite/28_regex/algorithms/regex_match/cstring_bracket_01.cc: Add more tests for ranges containing dashes. Check invalid ranges with character class at the beginning. (cherry picked from commit 7ce3c230edf6e498e125c805a6dd313bf87dc439)
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 --- Comment #10 from Jonathan Wakely --- The std::regex{"[\\w-a]"} case will throw a std::regex_error exception now. I'd like to backport this, but I'm going to wait a while. I am not entirely confident that my changes won't cause regressions elsewhere in the bracket handling.
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 --- Comment #9 from CVS Commits --- The master branch has been updated by Jonathan Wakely : https://gcc.gnu.org/g:7ce3c230edf6e498e125c805a6dd313bf87dc439 commit r12-5977-g7ce3c230edf6e498e125c805a6dd313bf87dc439 Author: Jonathan Wakely Date: Tue Dec 14 14:32:35 2021 + libstdc++: Fix handling of invalid ranges in std::regex [PR102447] std::regex currently allows invalid bracket ranges such as [\w-a] which are only allowed by ECMAScript when in web browser compatibility mode. It should be an error, because the start of the range is a character class, not a single character. The current implementation of _Compiler::_M_expression_term does not provide a way to reject this, because we only remember a previous character, not whether we just processed a character class (or collating symbol etc.) This patch replaces the pair used to emulate optional with a custom class closer to pair. That allows us to track three states, so that we can tell when we've just seen a character class. With this additional state the code in _M_expression_term for processing the _S_token_bracket_dash can be improved to correctly reject the [\w-a] case, without regressing for valid cases such as [\w-] and []. libstdc++-v3/ChangeLog: PR libstdc++/102447 * include/bits/regex_compiler.h (_Compiler::_BracketState): New class. (_Compiler::_BrackeyMatcher): New alias template. (_Compiler::_M_expression_term): Change pair parameter to _BracketState. Process first character for ECMAScript syntax as well as POSIX. * include/bits/regex_compiler.tcc (_Compiler::_M_insert_bracket_matcher): Pass _BracketState. (_Compiler::_M_expression_term): Use _BracketState to store state between calls. Improve handling of dashes in ranges. * testsuite/28_regex/algorithms/regex_match/cstring_bracket_01.cc: Add more tests for ranges containing dashes. Check invalid ranges with character class at the beginning.
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 Jonathan Wakely changed: What|Removed |Added Status|NEW |ASSIGNED
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 Ikarashi changed: What|Removed |Added CC||s.ikarashi at fujitsu dot com --- Comment #8 from Ikarashi --- > The ClassAtom \w does not contain exactly one character, so I think it's a > syntax error. If you process '\w', '-', and 'a' in this order, can the first \w be a ClassAtom anyway? According to the definition of Atom, it seems to be counted as a "\ AtomEscape" before the beginning of a CharacterClass. Atom[U, N] :: PatternCharacter . \ AtomEscape[?U, ?N] CharacterClass[?U] ( GroupSpecifier[?U] Disjunction[?U, ?N] ) ( ? : Disjunction[?U, ?N] ) A \w is a "\ CharacterClassEscape", so can be a "\ AtomEscape". I know it also can be a "\ ClassEscape" and a ClassAtomNoDash, however, \w-a looks two Atoms to me, one Atom \w and one Atom -a. Is there any rule defining the order of such interpretations?
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 --- Comment #7 from TC --- (In reply to Jonathan Wakely from comment #6) > I have looked in detail (I have the 3rd, 4th and 5th editions here) but my > brain started oozing out of my ears. > > 15.10.2.15 NonemptyClassRanges and 15.10.2.16 NonemptyClassRangesNoDash are > the relevant sections of the 1999 3rd edition. The former defines: > > The internal helper function CharacterRange takes two CharSet parameters > A and B and performs the following: > 1. If A does not contain exactly one character or B does not contain > exactly > one character then throw a SyntaxError exception. > > And the latter has this note: > > Informative comments: ClassRanges can expand into single ClassAtoms and/or > ranges of two ClassAtoms separated by dashes. In the latter case the > ClassRanges includes all characters between the first ClassAtom and the > second ClassAtom, inclusive; an error occurs if either ClassAtom does not > represent a single character (for example, if one is \w) or if the first > ClassAtom's code point value is greater than the second ClassAtom's code > point value. > > > > The ClassAtom \w does not contain exactly one character, so I think it's a > syntax error. > > The 3rd edition doesn't mention any legacy features of RegExp, but it does > seem to require the strict behaviour. I've looked at the 1999 spec now, and agree with your reading.
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 --- Comment #6 from Jonathan Wakely --- I have looked in detail (I have the 3rd, 4th and 5th editions here) but my brain started oozing out of my ears. 15.10.2.15 NonemptyClassRanges and 15.10.2.16 NonemptyClassRangesNoDash are the relevant sections of the 1999 3rd edition. The former defines: The internal helper function CharacterRange takes two CharSet parameters A and B and performs the following: 1. If A does not contain exactly one character or B does not contain exactly one character then throw a SyntaxError exception. And the latter has this note: Informative comments: ClassRanges can expand into single ClassAtoms and/or ranges of two ClassAtoms separated by dashes. In the latter case the ClassRanges includes all characters between the first ClassAtom and the second ClassAtom, inclusive; an error occurs if either ClassAtom does not represent a single character (for example, if one is \w) or if the first ClassAtom's code point value is greater than the second ClassAtom's code point value. The ClassAtom \w does not contain exactly one character, so I think it's a syntax error. The 3rd edition doesn't mention any legacy features of RegExp, but it does seem to require the strict behaviour.
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 TC changed: What|Removed |Added CC||rs2740 at gmail dot com --- Comment #5 from TC --- Hmm, but C++'s normative reference is to a 1999 version of ECMAScript...which might well have the "legacy" behavior? (I haven't looked at it in detail.)
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 Jonathan Wakely changed: What|Removed |Added Resolution|INVALID |--- Status|RESOLVED|NEW --- Comment #4 from Jonathan Wakely --- Reopening. JavaScript engines in web browsers accept invalid regexes for legacy support: https://262.ecma-international.org/#sec-additional-ecmascript-features-for-web-browsers If we're not implementing a browser engine, then it should be a syntax error: NonemptyClassRanges :: ClassAtom - ClassAtom ClassRanges It is a Syntax Error if IsCharacterClass of the first ClassAtom is true or IsCharacterClass of the second ClassAtom is true. It is a Syntax Error if IsCharacterClass of the first ClassAtom is false and IsCharacterClass of the second ClassAtom is false and the CharacterValue of the first ClassAtom is larger than the CharacterValue of the second ClassAtom.
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 Jonathan Wakely changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |INVALID --- Comment #3 from Jonathan Wakely --- Not a bug, this is the expected behaviour for ECMAScript regular expressions. Using one of the POSIX syntax options will cause regex_error to be thrown.
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 --- Comment #2 from Jonathan Wakely --- Actually, this might not be a bug. We have this comment in regex_compiler.tcc // POSIX doesn't allow '-' as a start-range char (say [a-z--0]), // except when the '-' is the first or last character in the bracket // expression ([--0]). ECMAScript treats all '-' after a range as a // normal character. Also see above, where _M_expression_term gets called. // // As a result, POSIX rejects [-], but ECMAScript doesn't. // Boost (1.57.0) always uses POSIX style even in its ECMAScript syntax. // Clang (3.5) always uses ECMAScript style even in its POSIX syntax. // // It turns out that no one reads BNFs ;) So [\w-a] is valid for the ECMAScript syntax and is equivalent to POSIX [-_[:alnum:]]. You can confirm this using your browser's javascript console, where this will print true: RegExp('[\\w-a]').test('-')
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 Jonathan Wakely changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |redi at gcc dot gnu.org
[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447 Marek Polacek changed: What|Removed |Added CC||mpolacek at gcc dot gnu.org Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed||2021-09-24 --- Comment #1 from Marek Polacek --- Confirmed.