[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2023-06-23 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

Jonathan Wakely  changed:

   What|Removed |Added

   Target Milestone|11.4|10.5

--- Comment #14 from Jonathan Wakely  ---
Backported for 10.5 too.

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2023-06-23 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

--- Comment #13 from CVS Commits  ---
The releases/gcc-10 branch has been updated by Jonathan Wakely
:

https://gcc.gnu.org/g:4c347b8d59958d5aa76c5fdcecd72478e08c5aa3

commit r10-11465-g4c347b8d59958d5aa76c5fdcecd72478e08c5aa3
Author: Jonathan Wakely 
Date:   Tue Dec 14 14:32:35 2021 +

libstdc++: Fix handling of invalid ranges in std::regex [PR102447]

std::regex currently allows invalid bracket ranges such as [\w-a] which
are only allowed by ECMAScript when in web browser compatibility mode.
It should be an error, because the start of the range is a character
class, not a single character. The current implementation of
_Compiler::_M_expression_term does not provide a way to reject this,
because we only remember a previous character, not whether we just
processed a character class (or collating symbol etc.)

This patch replaces the pair used to emulate
optional with a custom class closer to pair. That
allows us to track three states, so that we can tell when we've just
seen a character class.

With this additional state the code in _M_expression_term for processing
the _S_token_bracket_dash can be improved to correctly reject the [\w-a]
case, without regressing for valid cases such as [\w-] and [].

libstdc++-v3/ChangeLog:

PR libstdc++/102447
* include/bits/regex_compiler.h (_Compiler::_BracketState): New
class.
(_Compiler::_BrackeyMatcher): New alias template.
(_Compiler::_M_expression_term): Change pair
parameter to _BracketState. Process first character for
ECMAScript syntax as well as POSIX.
* include/bits/regex_compiler.tcc
(_Compiler::_M_insert_bracket_matcher): Pass _BracketState.
(_Compiler::_M_expression_term): Use _BracketState to store
state between calls. Improve handling of dashes in ranges.
* testsuite/28_regex/algorithms/regex_match/cstring_bracket_01.cc:
Add more tests for ranges containing dashes. Check invalid
ranges with character class at the beginning.

(cherry picked from commit 7ce3c230edf6e498e125c805a6dd313bf87dc439)

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2022-07-07 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

Jonathan Wakely  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED
   Target Milestone|--- |11.4

--- Comment #12 from Jonathan Wakely  ---
Fixed for 11.4 too.

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2022-07-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

--- Comment #11 from CVS Commits  ---
The releases/gcc-11 branch has been updated by Jonathan Wakely
:

https://gcc.gnu.org/g:c725028a8bb9478ec84332641147ad12b9236922

commit r11-10130-gc725028a8bb9478ec84332641147ad12b9236922
Author: Jonathan Wakely 
Date:   Tue Dec 14 14:32:35 2021 +

libstdc++: Fix handling of invalid ranges in std::regex [PR102447]

std::regex currently allows invalid bracket ranges such as [\w-a] which
are only allowed by ECMAScript when in web browser compatibility mode.
It should be an error, because the start of the range is a character
class, not a single character. The current implementation of
_Compiler::_M_expression_term does not provide a way to reject this,
because we only remember a previous character, not whether we just
processed a character class (or collating symbol etc.)

This patch replaces the pair used to emulate
optional with a custom class closer to pair. That
allows us to track three states, so that we can tell when we've just
seen a character class.

With this additional state the code in _M_expression_term for processing
the _S_token_bracket_dash can be improved to correctly reject the [\w-a]
case, without regressing for valid cases such as [\w-] and [].

libstdc++-v3/ChangeLog:

PR libstdc++/102447
* include/bits/regex_compiler.h (_Compiler::_BracketState): New
class.
(_Compiler::_BrackeyMatcher): New alias template.
(_Compiler::_M_expression_term): Change pair
parameter to _BracketState. Process first character for
ECMAScript syntax as well as POSIX.
* include/bits/regex_compiler.tcc
(_Compiler::_M_insert_bracket_matcher): Pass _BracketState.
(_Compiler::_M_expression_term): Use _BracketState to store
state between calls. Improve handling of dashes in ranges.
* testsuite/28_regex/algorithms/regex_match/cstring_bracket_01.cc:
Add more tests for ranges containing dashes. Check invalid
ranges with character class at the beginning.

(cherry picked from commit 7ce3c230edf6e498e125c805a6dd313bf87dc439)

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2021-12-14 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

--- Comment #10 from Jonathan Wakely  ---
The std::regex{"[\\w-a]"} case will throw a std::regex_error exception now. I'd
like to backport this, but I'm going to wait a while. I am not entirely
confident that my changes won't cause regressions elsewhere in the bracket
handling.

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2021-12-14 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

--- Comment #9 from CVS Commits  ---
The master branch has been updated by Jonathan Wakely :

https://gcc.gnu.org/g:7ce3c230edf6e498e125c805a6dd313bf87dc439

commit r12-5977-g7ce3c230edf6e498e125c805a6dd313bf87dc439
Author: Jonathan Wakely 
Date:   Tue Dec 14 14:32:35 2021 +

libstdc++: Fix handling of invalid ranges in std::regex [PR102447]

std::regex currently allows invalid bracket ranges such as [\w-a] which
are only allowed by ECMAScript when in web browser compatibility mode.
It should be an error, because the start of the range is a character
class, not a single character. The current implementation of
_Compiler::_M_expression_term does not provide a way to reject this,
because we only remember a previous character, not whether we just
processed a character class (or collating symbol etc.)

This patch replaces the pair used to emulate
optional with a custom class closer to pair. That
allows us to track three states, so that we can tell when we've just
seen a character class.

With this additional state the code in _M_expression_term for processing
the _S_token_bracket_dash can be improved to correctly reject the [\w-a]
case, without regressing for valid cases such as [\w-] and [].

libstdc++-v3/ChangeLog:

PR libstdc++/102447
* include/bits/regex_compiler.h (_Compiler::_BracketState): New
class.
(_Compiler::_BrackeyMatcher): New alias template.
(_Compiler::_M_expression_term): Change pair
parameter to _BracketState. Process first character for
ECMAScript syntax as well as POSIX.
* include/bits/regex_compiler.tcc
(_Compiler::_M_insert_bracket_matcher): Pass _BracketState.
(_Compiler::_M_expression_term): Use _BracketState to store
state between calls. Improve handling of dashes in ranges.
* testsuite/28_regex/algorithms/regex_match/cstring_bracket_01.cc:
Add more tests for ranges containing dashes. Check invalid
ranges with character class at the beginning.

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2021-12-13 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

Jonathan Wakely  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2021-10-03 Thread s.ikarashi at fujitsu dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

Ikarashi  changed:

   What|Removed |Added

 CC||s.ikarashi at fujitsu dot com

--- Comment #8 from Ikarashi  ---
> The ClassAtom \w does not contain exactly one character, so I think it's a 
> syntax error.

If you process '\w', '-', and 'a' in this order,
can the first \w be a ClassAtom anyway?
According to the definition of Atom,
it seems to be counted as a "\ AtomEscape" before the beginning of a
CharacterClass.

Atom[U, N] ::
  PatternCharacter
  .
  \ AtomEscape[?U, ?N]
  CharacterClass[?U]
  ( GroupSpecifier[?U] Disjunction[?U, ?N] )
  ( ? : Disjunction[?U, ?N] )

A \w is a "\ CharacterClassEscape", so can be a "\ AtomEscape".
I know it also can be a "\ ClassEscape" and a ClassAtomNoDash,
however, \w-a looks two Atoms to me, one Atom \w and one Atom -a.

Is there any rule defining the order of such interpretations?

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2021-10-02 Thread rs2740 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

--- Comment #7 from TC  ---
(In reply to Jonathan Wakely from comment #6)
> I have looked in detail (I have the 3rd, 4th and 5th editions here) but my
> brain started oozing out of my ears.
> 
> 15.10.2.15 NonemptyClassRanges and 15.10.2.16 NonemptyClassRangesNoDash are
> the relevant sections of the 1999 3rd edition. The former defines:
> 
>   The internal helper function CharacterRange takes two CharSet parameters
>   A and B and performs the following:
>   1. If A does not contain exactly one character or B does not contain
> exactly
>   one character then throw a SyntaxError exception.
> 
> And the latter has this note:
> 
>   Informative comments: ClassRanges can expand into single ClassAtoms and/or
>   ranges of two ClassAtoms separated by dashes. In the latter case the
>   ClassRanges includes all characters between the first ClassAtom and the
>   second ClassAtom, inclusive; an error occurs if either ClassAtom does not
>   represent a single character (for example, if one is \w) or if the first
>   ClassAtom's code point value is greater than the second ClassAtom's code
>   point value.
> 
> 
> 
> The ClassAtom \w does not contain exactly one character, so I think it's a
> syntax error.
> 
> The 3rd edition doesn't mention any legacy features of RegExp, but it does
> seem to require the strict behaviour.

I've looked at the 1999 spec now, and agree with your reading.

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2021-10-02 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

--- Comment #6 from Jonathan Wakely  ---
I have looked in detail (I have the 3rd, 4th and 5th editions here) but my
brain started oozing out of my ears.

15.10.2.15 NonemptyClassRanges and 15.10.2.16 NonemptyClassRangesNoDash are the
relevant sections of the 1999 3rd edition. The former defines:

  The internal helper function CharacterRange takes two CharSet parameters
  A and B and performs the following:
  1. If A does not contain exactly one character or B does not contain exactly
  one character then throw a SyntaxError exception.

And the latter has this note:

  Informative comments: ClassRanges can expand into single ClassAtoms and/or
  ranges of two ClassAtoms separated by dashes. In the latter case the
  ClassRanges includes all characters between the first ClassAtom and the
  second ClassAtom, inclusive; an error occurs if either ClassAtom does not
  represent a single character (for example, if one is \w) or if the first
  ClassAtom's code point value is greater than the second ClassAtom's code
  point value.



The ClassAtom \w does not contain exactly one character, so I think it's a
syntax error.

The 3rd edition doesn't mention any legacy features of RegExp, but it does seem
to require the strict behaviour.

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2021-10-01 Thread rs2740 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

TC  changed:

   What|Removed |Added

 CC||rs2740 at gmail dot com

--- Comment #5 from TC  ---
Hmm, but C++'s normative reference is to a 1999 version of ECMAScript...which
might well have the "legacy" behavior? (I haven't looked at it in detail.)

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2021-10-01 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

Jonathan Wakely  changed:

   What|Removed |Added

 Resolution|INVALID |---
 Status|RESOLVED|NEW

--- Comment #4 from Jonathan Wakely  ---
Reopening. JavaScript engines in web browsers accept invalid regexes for legacy
support:
https://262.ecma-international.org/#sec-additional-ecmascript-features-for-web-browsers

If we're not implementing a browser engine, then it should be a syntax error:

 NonemptyClassRanges :: ClassAtom - ClassAtom ClassRanges

It is a Syntax Error if IsCharacterClass of the first ClassAtom is true or
IsCharacterClass of the second ClassAtom is true.

It is a Syntax Error if IsCharacterClass of the first ClassAtom is false
and IsCharacterClass of the second ClassAtom is false and the CharacterValue of
the first ClassAtom is larger than the CharacterValue of the second ClassAtom.

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2021-09-27 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

Jonathan Wakely  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |INVALID

--- Comment #3 from Jonathan Wakely  ---
Not a bug, this is the expected behaviour for ECMAScript regular expressions.

Using one of the POSIX syntax options will cause regex_error to be thrown.

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2021-09-24 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

--- Comment #2 from Jonathan Wakely  ---
Actually, this might not be a bug.

We have this comment in regex_compiler.tcc

  // POSIX doesn't allow '-' as a start-range char (say [a-z--0]),
  // except when the '-' is the first or last character in the bracket
  // expression ([--0]). ECMAScript treats all '-' after a range as a
  // normal character. Also see above, where _M_expression_term gets
called.
  //
  // As a result, POSIX rejects [-], but ECMAScript doesn't.
  // Boost (1.57.0) always uses POSIX style even in its ECMAScript syntax.
  // Clang (3.5) always uses ECMAScript style even in its POSIX syntax.
  //
  // It turns out that no one reads BNFs ;)


So [\w-a] is valid for the ECMAScript syntax and is equivalent to POSIX
[-_[:alnum:]].

You can confirm this using your browser's javascript console, where this will
print true:

RegExp('[\\w-a]').test('-')

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2021-09-24 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

Jonathan Wakely  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |redi at gcc dot gnu.org

[Bug libstdc++/102447] std::regex incorrectly accepts invalid bracket expression

2021-09-24 Thread mpolacek at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102447

Marek Polacek  changed:

   What|Removed |Added

 CC||mpolacek at gcc dot gnu.org
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2021-09-24

--- Comment #1 from Marek Polacek  ---
Confirmed.