> -----Original Message-----
> From: Stephane Chazelas [mailto:stephane.chaze...@gmail.com]
> Sent: Sunday, May 20, 2018 10:43 PM
> To: Geoff Clare
> Cc: austin-group-l@opengroup.org
> Subject: Re: can [[:digit:]] match something other than 0123456789?
> 

> Note that having [x-y] be based on collation order would mean that things 
> like [a-z]
> would also match on uppercase letters in the latin script in locales where 
> case is
> not considered in the first weight for sorting (as is typical for English 
> locales for
> instance).
> 
> 
> Now, in a en_GB.UTF-8 locale on GNU/Linux (here ubuntu 16.04) for instance, 
> both
> bash's and ksh93's [0-9] matches on at least
> 142 different characters (see below). That matches on 0123456789 but also 
> digits 0
> (sometime 1) to 8 (sometimes 9 like for U+0669 which sorts the same as 9 
> there!)
> in other scripts, and some other random decimal digits, and some non-digits 
> and
> is far from including all the plethora of other decimal digits in Unicode.
> (unicode --max 0 --regexp 
> 'digit.(one|two|three|four|five|six|seven|eight|nine)\b' |
> grep -c '^U+'
> retuns 696 with an old version of unicode, and that doesn't even include 
> things like
> roman numerals).

I'd find [0-9] matching on just "western" digits and [[:digit:]] matching
on the locale's digits the most natural solution.  If someone wanted to match
on Devanagari or whatever digits, she could simply list them in the bracket 
expression, rather than using
"western" digits.  If [0-9] is understood to be [[:digit:]], how could one 
differentiate between "western"
and, say, Devanagari digits (other than listing them each explicitly, 
[0123456789], as Stephane has done)?

Same goes for [a-z]: these should match (or should be) the Roman letters, not 
alphabetic characters
in general.

Also, my feeling is that [[:digit:]] should match just the digits that are 
actually relevant for that locale, e.g.,
just "western" digits for en_GB.  And fractions and superscripts are not digits.

If you really want to match any digit in any language, you could add a 
"Unicode" locale or perhaps region.

Reply via email to