2018-05-22 13:49:20 +0100, Stephane CHAZELAS:
[...]
> In the case of the fnmatch and regexp of most systems, I don't
> know how they make so that [0-9] only matches on 0123456789 or
> [a-z] not on uppercase letters. Possibly, that's with special
> cases as well.
[...]

Sorry, my bad. It looks like I was basing my conclusions on
tests I thought I remembered doing but probably never did.

[0-9] matches on characters other than 0123456789 on many
systems with grep and system regexps as well.

On Solaris 10, in a en_GB.UTF-8 locale, with /usr/xpg4/bin/grep,
it matches on hundreds of different characters many of which
have nothing to do with digits or are not even assigned in
Unicode. Its [a-z] matches on ABC...WXY and hundreds more and
even parts of characters like the 0xf0..0xf4 of characters
U+10000 to U+10FFFF.

On FreeBSD, [0-9] matches on U+2185 ROMAN NUMERAL SIX LATE FORM
in addition to 0123456789 (!?).

In GNU locales, whether [a-z] matches BCD..WXY or not depends on
the locale and the version of glibc. [0-9] does not always
include only 0123456789 either. For instance, in a th_TH.UTF-8
locale, grep '[a-z]' matches on M and grep '[0-9]' matches on

U+0E50 THAI DIGIT ZERO
U+0E51 THAI DIGIT ONE
U+0E52 THAI DIGIT TWO
U+0E53 THAI DIGIT THREE
U+0E54 THAI DIGIT FOUR
U+0E55 THAI DIGIT FIVE
U+0E56 THAI DIGIT SIX
U+0E57 THAI DIGIT SEVEN
U+0E58 THAI DIGIT EIGHT

(note the missing DIGIT NINE which would sort after 9).

So, that confirms that it's not only a bash/ksh93 "issue", [0-9]
cannot be used to match 0123456789 only and what it matches is
random and useless and not what one would ever want.

[a-z] is not guaranteed to match on lower case letters only let
alone abcdefghijklmnopqrstuvwxyz only, it may even match on
characters outside the latin script.

LC_ALL=C grep '[0-9]'

Would be OK, but not in locales that use charsets that have
characters that contain the encoding of digits (like GB18030,
BIG5...).

It was requested that [[:digit:]] match only on 0123456789.
While in practice, it seems to be the case for things that use
the POSIX API, it's not always the case outside of it (where
[0-9] generally matches on 0123456789 but [[:digit:]] can match
on all sorts of decimal digits). Like

perl -Mopen=locale -ne 'print if /[[:digit:]]/'

So it would seem that [0123456789] is the only portable way to
match on 0123456789 only.

-- 
Stephane

Reply via email to