Hello,
I realize the issue of character range expressions not working as
expected (because of locale settings) has been done to death, but I
thought I should point this out.
The bash man page says:
"A pair of characters separated by a hyphen denotes a range expression;
any character that ***sorts between those two characters,*** inclusive,
using the current locale's collating sequence and character set, is
matched." (emphasis mine)
That is incorrect because, for instance, an uppercase 'C' sorts between
lowercase 'a' and lowercase 'c' (sometimes), as in this example (locale
is en_GB.UTF-8):
$ touch aa B cd C
$ ls -1
aa
B
C
cd
However, bash's behaviour does not reflect what the man page says. Observe:
$ touch aa B cd C
$ ls -1 [a-c]*
aa
B
cd
Now, I'm firmly of the opinion that character range expressions paying
any attention at all to the locale collation settings, in any shape or
form, is completely broken behaviour. I really wish that [a-c] meant
[abc] and not [aAbBc].
But, it looks as if that's not going to change, so it is my hope that
the documentation will at least be updated to reflect what really happens.
Previous posters who've complained about this character range issue have
been directed to some comments made by Ulrich Drepper (who, I
understand, is a maintainer of some underlying code that bash uses in
its evaluation of range expressions?). Those comments include this:
"The strcoll result has nothing whatsoever to do with the range match.
strcoll uses collation weights, ranges use collation sequence values,
completely different concept."
I believe that same confusion is behind the problem in that paragraph
from the man page and has led to the inappropriate use of the phrase
"sorts between." The bit of man page text I quoted above should read:
"A pair of characters separated by a hyphen denotes a range expression;
any character that ***occurs between those two characters in collation
sequence value,*** inclusive, using the current locale's collating
sequence and character set, is matched."
I believe it would also be helpful for the documentation to then go on
to say something like this:
"This means that character ranges are neither case-sensitive nor
case-insensitive in most locales. For instance (in the en_ locales), the
range [a-c] is equivalent to [aAbBc] (note the absence of uppercase
'C'!). Thus, sub-ranges of the character class [[:alpha:]] must be used
with great care, and probably should not be used at all, in locales
other than C. It is not possible, for example, to specify a range of
greater than one or fewer than 26 lowercase letters in the en_US.UTF-8
locale. If you desire to match [abcdefghij] in this locale, you must not
use a range, but specify all of those characters explicitly, or use
LC_COLLATE from the C locale."
In closing, it is my fervent hope that the insanity of that last
paragraph will be recognized (when is [a-c] being equivalent to [aAbBc]
ever useful?!), and that this will eventually lead to character ranges
becoming useful again regardless of the current locale.
But in the mean time, I would settle for a documentation change, and
will continue to "export LC_COLLATE=C"! :)
~Felix.