Re: Regex: A case where the longest match isn't being found

2023-10-27 Thread Chet Ramey

On 10/27/23 1:04 AM, Lawrence Velázquez wrote:

On Fri, Oct 27, 2023, at 12:25 AM, Grisha Levit wrote:

On Thu, Oct 26, 2023, 20:30 Dale R. Worley  wrote:


I suspect the difference between the versions is how the regexp is
unquoted while it is being read, with version 3 interpreting [^\'] as
"character class excluding newline, backslash, and quote" and version 5
interpreting it as "character class excluding newline and quote".



That seems right


How?  If it were right, wouldn't these produce different results?


It's not. bash-current and bash-3.2 pass the same string ("\") and
pattern ("[^']") to sh_regmatch.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: Regex: A case where the longest match isn't being found

2023-10-26 Thread Lawrence Velázquez
On Fri, Oct 27, 2023, at 12:25 AM, Grisha Levit wrote:
> On Thu, Oct 26, 2023, 20:30 Dale R. Worley  wrote:
>
>> I suspect the difference between the versions is how the regexp is
>> unquoted while it is being read, with version 3 interpreting [^\'] as
>> "character class excluding newline, backslash, and quote" and version 5
>> interpreting it as "character class excluding newline and quote".
>>
>
> That seems right

How?  If it were right, wouldn't these produce different results?

% cat foo.bash
printf %s\\n "$BASH_VERSION"
if [[ \\ =~ [^\'] ]]
then
printf %s\\n "${BASH_REMATCH[0]}"
fi
% /bin/bash ./foo.bash
3.2.57(1)-release
\
% /opt/local/bin/bash ./foo.bash
5.2.15(1)-release
\

-- 
vq



Re: Regex: A case where the longest match isn't being found

2023-10-26 Thread Grisha Levit
On Thu, Oct 26, 2023, 20:30 Dale R. Worley  wrote:

> I suspect the difference between the versions is how the regexp is
> unquoted while it is being read, with version 3 interpreting [^\'] as
> "character class excluding newline, backslash, and quote" and version 5
> interpreting it as "character class excluding newline and quote".
>

That seems right, just note that Bash is matching without the REG_NEWLINE
flag -- [[ $'\n' =~ [^x] ]] is true.

>


Re: Regex: A case where the longest match isn't being found

2023-10-26 Thread Dan Bornstein
Thanks to the folks who replied.

Indeed, I misunderstood the "longest match" rule to apply to captures and not 
just the whole string. (That is, I thought an earlier capture would get "first 
dibs" on any matching text.) And, as was pointed out by Greg W, the exact 
behavior depends more on the regex library that Bash got linked with than 
anything that Bash inherently does.

Thanks again, and sorry for the noise.

-dan

On Thu, Oct 26, 2023, at 10:50 AM, Dan Bornstein wrote:
> Configuration Information [Automatically generated, do not change]:
> Machine: aarch64
> OS: linux-gnu
> Compiler: gcc
> Compilation CFLAGS: -O2 -ftree-vectorize -flto=auto -ffat-lto-objects 
> -fexcepti\
> ons -g -grecord-gcc-switches -pipe -Wall -Werror=format-security 
> -Wp,-D_FORTIFY\
> _SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS 
> -specs=/usr/lib/rpm/redhat/redhat-hardened-\
> cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  
> -ma\
> rch=armv8.2-a+crypto -mtune=neoverse-n1 -mbranch-protection=standard 
> -fasynchro\
> nous-unwind-tables -fstack-clash-protection
> uname output: Linux i-062640626b26bd9ed.us-west-2.compute.internal 
> 6.1.25-37.47\
> .amzn2023.aarch64 #1 SMP Mon Apr 24 23:19:51 UTC 2023 aarch64 aarch64 aarch64 
> G\
> NU/Linux
> Machine Type: aarch64-amazon-linux-gnu
> 
> Bash Version: 5.2
> Patch Level: 15
> Release Status: release
> 
> Description:
> 
> I found a case where the regex evaluator doesn't seem to be finding the 
> longest possible match for a given expression. The expression works as 
> expected on an older version of Bash (3.2.57(1)-release 
> (arm64-apple-darwin22)).
> 
> Here's the regex: ^(\$\'([^\']|\\\')*\')(.*)$
> 
> (FWIW, this is meant to capture a string that looks like an ANSI-style 
> literal string, plus a "rest" for further processing.)
> 
> Repeat-By:
> 
> For example, run this:
> 
> [[ $'$\'foo\\\' x\' bar' =~ ^(\$\'([^\']|\\\')*\')(.*)$ ]] && echo 
> "${BASH_REMATCH[1]}"
> 
> On v5.2, this prints: $'foo\'
> On v3.2.57, this prints: $'foo\' x'
> 
> 


Re: Regex: A case where the longest match isn't being found

2023-10-26 Thread Lawrence Velázquez
On Thu, Oct 26, 2023, at 7:01 PM, Greg Wooledge wrote:
> On Thu, Oct 26, 2023 at 10:50:13AM -0700, Dan Bornstein wrote:
>> I found a case where the regex evaluator doesn't seem to be finding the 
>> longest possible match for a given expression. The expression works as 
>> expected on an older version of Bash (3.2.57(1)-release 
>> (arm64-apple-darwin22)).
>
> Bash uses the system's (libc) version of regex(3), so the difference
> you're seeing is presumably caused by the apple-darwin22 part, rather
> than the bash 3.2 part.  (Or conversely, caused by the linux-gnu part
> rather than the bash 5.2 part.)

Indeed, on my macOS system *both* 3.2.57 and 5.2.15 output

$'foo\' x'

and on shbot (the Linux IRC bot used in #bash) *both* 3.2.48 and
5.2.9 output

$'foo\'

-- 
vq



Re: Regex: A case where the longest match isn't being found

2023-10-26 Thread Dale R. Worley
"Dan Bornstein"  writes:
> I found a case where the regex evaluator doesn't seem to be finding
> the longest possible match for a given expression. The expression
> works as expected on an older version of Bash (3.2.57(1)-release
> (arm64-apple-darwin22)).
>
> Here's the regex: ^(\$\'([^\']|\\\')*\')(.*)$

This would be *much* easier to understand if you'd rewritten it into a
test case where none of the characters that the regex was consuming were
regex metacharacters, e.g. letters.  That would make everything far
easier to read.

> (FWIW, this is meant to capture a string that looks like an ANSI-style
> literal string, plus a "rest" for further processing.)

*If* I read the regex correctly, it must match the entire string, ^...$.
Then it matches in two parenthesized strings.
The second string is .*, as you san "the rest for further processing".
The first string is $, followed by a single ', followed by any number
of:
- one character that is not '
- the character \ followed by '
followed by a single '

Note that this regex does not give the matcher any leeway; it either
matches a string or not, and if it matches, it matches in only one way.
This isn't a question of whether it is choosing "the longest match".

*If* I read the subject string correctly, it is

$'foo\' x' bar

(with two internal spaces and none at the beginning or end).

So it does seem to say that the first parenthesized string should match

$'foo\' x'

> For example, run this:
>
> [[ $'$\'foo\\\' x\' bar' =~ ^(\$\'([^\']|\\\')*\')(.*)$ ]] && echo 
> "${BASH_REMATCH[1]}"
>
> On v5.2, this prints: $'foo\'
> On v3.2.57, this prints: $'foo\' x'

Executing this suggests that the subject string is being interpreted as
intended:

$ echo -E $'$\'foo\\\' x\' bar'
$'foo\' x' bar
$

OK, here's the problem.  Compare these executions, which have an
additional \\ inserted in the character class specification [...]:

$ [[ $'$\'foo\\\' x\' bar' =~ ^(\$\'([^\']|\\\')*\')(.*)$ ]] && echo -E 
"${BASH_REMATCH[1]}"
$'foo\'
$ [[ $'$\'foo\\\' x\' bar' =~ ^(\$\'([^\']|\\\')*\')(.*)$ ]] && echo -E 
"${BASH_REMATCH[3]}"
 x' bar
$ [[ $'$\'foo\\\' x\' bar' =~ ^(\$\'([^\\\']|\\\')*\')(.*)$ ]] && echo -E 
"${BASH_REMATCH[1]}"
$'foo\' x'
$ [[ $'$\'foo\\\' x\' bar' =~ ^(\$\'([^\\\']|\\\')*\')(.*)$ ]] && echo -E 
"${BASH_REMATCH[3]}"
 bar
$

What you wrote was [^\'].  The backslash is consumed while the regexp is
being read and is interpreted as causing the succeeding ' to be
non-magic.  Of course, that ' isn't magic in that context.  But it means
that the character class includes all non-newline characters other than
'.  Specifically, that *includes* backslash.  So the regexp matcher sees
the first alternative of the | as matching an isolated backslash.

That means that when the matcher is processing the iterator *, at that
iteration, it will first match the character class, which matches, then
attempt to iterate further (which will fail, exiting the iterator),
match the closing ' (which will succeed), match the (.*) (which will
succeed), match the $ (which will succeed), then return that match.

Now, if the part of the pattern following the iterator failed, the
matcher would backtrack until it attempted the *second* branch of the
alternative in the final iteration, which would *also* match, but would
leave the matcher lookng at " x'...", which would allow it to continue
iterating.

The complication is that if you'd written [^\\\'], the matcher would
have only one way to match any subject string, and the matching process
could be ignored.  But with [^\'], there are multiple ways to match, and
you get the first one the matcher finds.  Specifically, ?, +, and * are
"greedy", they start by matching as many copies of their sub-pattern as
they are allowed, and if backtracked into, reduce the number of
iterations one at a time as far as they're allowed.  Alternations,
though, start by attempting the first alternative, then the second,
etc., and if backtracked into after succeeding with one alternative, try
the next one.  Ugh, that's a rough outline; it's not quite that simple.

I suspect the difference between the versions is how the regexp is
unquoted while it is being read, with version 3 interpreting [^\'] as
"character class excluding newline, backslash, and quote" and version 5
interpreting it as "character class excluding newline and quote".

Dale



Re: Regex: A case where the longest match isn't being found

2023-10-26 Thread Greg Wooledge
On Thu, Oct 26, 2023 at 10:50:13AM -0700, Dan Bornstein wrote:
> I found a case where the regex evaluator doesn't seem to be finding the 
> longest possible match for a given expression. The expression works as 
> expected on an older version of Bash (3.2.57(1)-release 
> (arm64-apple-darwin22)).

Bash uses the system's (libc) version of regex(3), so the difference
you're seeing is presumably caused by the apple-darwin22 part, rather
than the bash 3.2 part.  (Or conversely, caused by the linux-gnu part
rather than the bash 5.2 part.)



Regex: A case where the longest match isn't being found

2023-10-26 Thread Dan Bornstein
Configuration Information [Automatically generated, do not change]:
Machine: aarch64
OS: linux-gnu
Compiler: gcc
Compilation CFLAGS: -O2 -ftree-vectorize -flto=auto -ffat-lto-objects -fexcepti\
ons -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY\
_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-\
cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -ma\
rch=armv8.2-a+crypto -mtune=neoverse-n1 -mbranch-protection=standard -fasynchro\
nous-unwind-tables -fstack-clash-protection
uname output: Linux i-062640626b26bd9ed.us-west-2.compute.internal 6.1.25-37.47\
.amzn2023.aarch64 #1 SMP Mon Apr 24 23:19:51 UTC 2023 aarch64 aarch64 aarch64 G\
NU/Linux
Machine Type: aarch64-amazon-linux-gnu

Bash Version: 5.2
Patch Level: 15
Release Status: release

Description:

I found a case where the regex evaluator doesn't seem to be finding the longest 
possible match for a given expression. The expression works as expected on an 
older version of Bash (3.2.57(1)-release (arm64-apple-darwin22)).

Here's the regex: ^(\$\'([^\']|\\\')*\')(.*)$

(FWIW, this is meant to capture a string that looks like an ANSI-style literal 
string, plus a "rest" for further processing.)

Repeat-By:

For example, run this:

[[ $'$\'foo\\\' x\' bar' =~ ^(\$\'([^\']|\\\')*\')(.*)$ ]] && echo 
"${BASH_REMATCH[1]}"

On v5.2, this prints: $'foo\'
On v3.2.57, this prints: $'foo\' x'