documentation bug re character range expressions

Marcel (Felix) Giannelia Thu, 02 Jun 2011 18:12:53 -0700

Hello,

I realize the issue of character range expressions not working asexpected (because of locale settings) has been done to death, but Ithought I should point this out.


The bash man page says:

"A pair of characters separated by a hyphen denotes a range expression;any character that ***sorts between those two characters,*** inclusive,using the current locale's collating sequence and character set, ismatched." (emphasis mine)

That is incorrect because, for instance, an uppercase 'C' sorts betweenlowercase 'a' and lowercase 'c' (sometimes), as in this example (localeis en_GB.UTF-8):


$ touch aa B cd C
$ ls -1
aa
B
C
cd

However, bash's behaviour does not reflect what the man page says. Observe:

$ touch aa B cd C
$ ls -1 [a-c]*
aa
B
cd

Now, I'm firmly of the opinion that character range expressions payingany attention at all to the locale collation settings, in any shape orform, is completely broken behaviour. I really wish that [a-c] meant[abc] and not [aAbBc].

But, it looks as if that's not going to change, so it is my hope thatthe documentation will at least be updated to reflect what really happens.

Previous posters who've complained about this character range issue havebeen directed to some comments made by Ulrich Drepper (who, Iunderstand, is a maintainer of some underlying code that bash uses inits evaluation of range expressions?). Those comments include this:

"The strcoll result has nothing whatsoever to do with the range match.strcoll uses collation weights, ranges use collation sequence values,completely different concept."

I believe that same confusion is behind the problem in that paragraphfrom the man page and has led to the inappropriate use of the phrase"sorts between." The bit of man page text I quoted above should read:

"A pair of characters separated by a hyphen denotes a range expression;any character that ***occurs between those two characters in collationsequence value,*** inclusive, using the current locale's collatingsequence and character set, is matched."

I believe it would also be helpful for the documentation to then go onto say something like this:

"This means that character ranges are neither case-sensitive norcase-insensitive in most locales. For instance (in the en_ locales), therange [a-c] is equivalent to [aAbBc] (note the absence of uppercase'C'!). Thus, sub-ranges of the character class [[:alpha:]] must be usedwith great care, and probably should not be used at all, in localesother than C. It is not possible, for example, to specify a range ofgreater than one or fewer than 26 lowercase letters in the en_US.UTF-8locale. If you desire to match [abcdefghij] in this locale, you must notuse a range, but specify all of those characters explicitly, or useLC_COLLATE from the C locale."

In closing, it is my fervent hope that the insanity of that lastparagraph will be recognized (when is [a-c] being equivalent to [aAbBc]ever useful?!), and that this will eventually lead to character rangesbecoming useful again regardless of the current locale.

But in the mean time, I would settle for a documentation change, andwill continue to "export LC_COLLATE=C"! :)


~Felix.

documentation bug re character range expressions

Reply via email to