On Tue, May 20, 2025 at 19:08:10 +0000, FunnyMan Computer wrote:
>         I failed multiple times on getting similar results to what I was
>         expecting from using grep just using the [a-z] and [a-z]+ classes -
>         expecting multiple results from $BASH_REMATCH but it's only picking
>         up 1 character at most, while grep -E is able to pick up all the
>         characters (which is weird, since the class [a-z]+$ gives completely
>         similar results).

My first reaction: [a-z] is dangerous.  It matches *some* definition
of "a character between lowercase a and lowercase z, inclusive", but
depending on your locale, this may or may not be equivalent to
"the set of lowercase letters in the ASCII character set".

If you want [a-z] to work like ASCII does, you'll need to use LC_CTYPE=C.
If you want to match lowercase letters in your current locale, you
should use [[:lower:]] instead.

Moving along....

> So, I was wondering whether this was a bug or intended and I'm just
> misinterpreting how bash does regular expressions. I tried reading the bash
> manual on the '=~' operator,

> Repeat-By:
>       grep:
>             `$ echo test-test | POSIXLY_CORRECT=1 grep -E [a-z]`
>             `^test^-^test^`
> 
>             `$ echo test-tesst | POSIXLY_CORRECT=1 grep -E [a-z]+`
>             `^test^-^tesst^`

Second reaction: you forgot to quote the regex.  It might match a file
in the current working directory and be replaced by the shell.

Assuming that there are no matching files in your current directory....

In both examples, you have a single line of input, and the line happens
to match the regex you gave.  So, grep prints that line.

>         bash's '=~' and $BASH_REMATCH:
>             ```
>             $ if [[ test-test =~ [a-z] ]]; then
>                 for i in "${!BASH_REMATCH[@]}"; do
>                     echo "$i: ${BASH_REMATCH[$i]}";
>                 done
>             fi
>             ```
>             `0: t`

You can use "declare -p BASH_REMATCH" to show the array more easily.

Now, repeating what appears to be your bug report:

>         expecting multiple results from $BASH_REMATCH but it's only picking
>         up 1 character at most, while grep -E is able to pick up all the
>         characters (which is weird, since the class [a-z]+$ gives completely
>         similar results).

grep always prints whole lines.  It doesn't just print the matching part
of a line.

BASH_REMATCH stores matching parts of the input string.  Index 0 stores
the whole matching substring, and indexes 1+ store pieces that match
parenthesized sub-expressions (which you're not currently using).

hobbit:~$ echo "$BASH_VERSION"
5.2.15(1)-release
hobbit:~$ [[ test-test =~ [[:lower:]] ]] ; declare -p BASH_REMATCH
declare -a BASH_REMATCH=([0]="t")

In the above example, [[:lower:]] matches a single lowercase letter
in my locale, and the first such letter is 't'.  So, that's what
gets matched and stored in BASH_REMATCH[0].

hobbit:~$ [[ test-test =~ [[:lower:]]+ ]] ; declare -p BASH_REMATCH
declare -a BASH_REMATCH=([0]="test")

In the above example, [[:lower:]]+ matches a sequence of one or more
lowercase letters in my locale.  Regular expressions are always greedy,
so it will match as many letters as possible.  In this case, the
string "test" is matched and stored.

hobbit:~$ [[ test-ing =~ [[:lower:]]+$ ]] ; declare -p BASH_REMATCH
declare -a BASH_REMATCH=([0]="ing")

In the above example, I'm using [[:lower:]]+$ because you mentioned
wanting to use [a-z]+$ earlier.  I also changed the input string so
that we can see whether it's matching the left hand side or the right
hand side of the input.  In this case, it matches the right hand side.

Reply via email to