Issue |
154408
|
Summary |
[libc++] <regex>: Unmatched backrefs should always succeed in ECMAScript mode.
|
Labels |
libc++
|
Assignees |
|
Reporter |
SainoNamkho
|
ECMA-262, 1999, 15.10.2.9 states
> An escape sequence of the form `\` followed by a nonzero decimal number $n$
matches the result of the nth set of capturing parentheses (see 15.10.2.11). It is an error if the regular
_expression_ has fewer than $n$ capturing parentheses. If the regular _expression_ has $n$ or more capturing
parentheses but the nth one is **undefined** because it hasn't captured anything, then the backreference
always succeeds.
So this js code
``` js
for (let pattern of [/(\1)a/, /\1(a)/, /(a()|)\2a/, /(b()|)\2a/, /(b()|)\2/])
if (pattern.test('a'))
console.log(`${pattern} matches "a".`)
else
console.log(`${pattern} does not match "a".`)
```
prints
```
/(\1)a/ matches "a".
/\1(a)/ matches "a".
/(a()|)\2a/ matches "a".
/(b()|)\2a/ matches "a".
/(b()|)\2/ matches "a".
```
C++ implementations diverge https://godbolt.org/z/r51f4Wsas:
``` c++
#include <regex>
#include <print>
int main()
{
for (auto pattern : {
"(\\1)a", "\\1(a)",
"(a()|)\\2a", "(b()|)\\2",
}) {
try {
if (std::regex_search("a", std::regex{pattern})
)
std::println("\x1b[32m/{}/ matches.\x1b[m", pattern);
else
std::println("\x1b[31m/{}/ does not match.\x1b[m", pattern);
}
catch (const std::regex_error& e) {
if (e.code() == std::regex_constants::error_backref)
std::println("\x1b[34m/{}/: {}\x1b[m", pattern, e.what());
else
throw;
}
}
}
```
libstdc++ and msstl rejects `(\1)a` and `\1(a)` as invalid regex, libc++ accepts `(\1)a`, I believe there're no invalid backrefs in ecma flavor, but this can be dicussed in another issue.
I'm intereted in that `(a()|)\2a` and `(b()|)\2` should match `a`.
[A POC fix](https://godbolt.org/z/x6xnKjx8f)
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs