Glenn, as context, see http://www.unicode.org/faq/collation.html and http://www.unicode.org/reports/tr10/ (Unicode Technical Standard #10: Unicode Collation Algorithm).
I think, because the libc locale modules do implement the collation sorting, the question is, do the UTF-8 locales implement the Unicode standard collation algorithm? This might be a question for Dr. Fink. Olga On Mon, Dec 3, 2012 at 6:25 PM, Cedric Blancher <[email protected]> wrote: > On 3 December 2012 18:16, Glenn Fowler <[email protected]> wrote: >> >> On Mon, 3 Dec 2012 18:07:00 +0100 Cedric Blancher wrote: >>> On 20 November 2012 16:27, Glenn Fowler <[email protected]> wrote: >>> > >>> > On Tue, 20 Nov 2012 10:04:36 +0100 Cedric Blancher wrote: >>> >> On 17 November 2012 11:25, Roland Mainz <[email protected]> wrote: >>> >> > On Fri, Nov 16, 2012 at 6:00 PM, Roland Mainz >>> >> > <[email protected]> wrote: >>> >> >> On Fri, Nov 16, 2012 at 5:57 PM, Roland Mainz >>> >> >> <[email protected]> wrote: >>> >> >>> The following testcase (which should basically test whether the >>> >> >>> SystemV "tr" range expression [a-z] works with 'a' and 'z' replaced >>> >> >>> with \u[20a0] and \u[20af] ...) ... >>> >> >>> -- snip -- >>> >> >>> $ ~/bin/ksh -x -c $'builtin tr ; tr -c >>> >> >>> $\'[:digit:][\u[20a0]-\u[20af]][:alpha:]\' "[\\n*]" <<<$\'hello >>> >> >>> chicken \u[20ac] world\' ; true' >>> >> >>> -- snip -- >>> >> >>> ... should AFAIK print something like this: >>> >> >>> -- snip -- >>> >> >>> + builtin tr >>> >> >>> + tr -c $'[:digit:][\u[20a0]-\u[20af]][:alpha:]' '[\n*]' >>> >> >>> + 0<<< hello chicken € world >>> >> >>> hello >>> >> >>> chicken >>> >> >>> >>> >> >>> >>> >> >>> world >>> >> >>> + true >>> >> >>> >>> >> >>> -- snip -- >>> >> >>> ... but ast-ksh.2012-11-24 with Glenn's latest tr.c changes gives >>> >> >>> this output: >>> >> >>> -- snip -- >>> >> >>> + builtin tr >>> >> >>> + tr -c $'[:digit:][\u[20a0]-\u[20af]][:alpha:]' '[\n*]' >>> >> >>> + 0<<< hello chicken € world >>> >> >>> hello >>> >> >>> chicken >>> >> >>> € >>> >> >>> world >>> >> >>> + true >>> >> >>> >>> >> >>> -- snip -- >>> >> >>> >>> >> >>> Erm... does anyone spot the mistake ? Or is this a AST "tr" bug ? >>> >> >> >>> >> >> BTW: It seems to work if I remove the leading [:digit:] expression: >>> >> >> -- snip -- >>> >> >> $ ~/bin/ksh -x -c $'builtin tr ; tr -c >>> >> >> $\'[\u[20a0]-\u[20af]][:alpha:]\' "[\\n*]" <<<$\'hello chicken >>> >> >> \u[20ac] world\' ; true' >>> >> >> + builtin tr >>> >> >> + tr -c $'[\u[20a0]-\u[20af]][:alpha:]' '[\n*]' >>> >> >> + 0<<< hello chicken € world >>> >> >> hello >>> >> >> chicken >>> >> >> € >>> >> >> world >>> >> >> + true >>> >> >> -- snip -- >>> >> > >>> >> > ... or if I put the [:digit:] at the end: >>> >> > -- snip -- >>> >> > $ ~/bin/ksh -x -c $'builtin tr ; tr -c >>> >> > $\'[\u[20a0]-\u[20af]][:alpha:][:digit:]\' "[\\n*]" <<<$\'hello >>> >> > chicken 6a \u[20ac] world\' ; true' >>> >> > + builtin tr >>> >> > + tr -c $'[\u[20a0]-\u[20af]][:alpha:][:digit:]' '[\n*]' >>> >> > + 0<<< hello chicken 6a € world >>> >> > hello >>> >> > chicken >>> >> > 6a >>> >> > € >>> >> > world >>> >> > + true >>> >> > -- snip -- >>> >> > >>> >> > ... erm... question for Glenn: >>> >> > Must range patterns (e.g. [a-z] or 'a' and 'z' replaced by Unicode >>> >> > characters) be sorted before character classes like [:digit:] or >>> >> > [:alpha:] (this may be a case where a --strict option should >>> >> > warn/complain if the arguments must be sorted) ? >>> > >>> >> The current implementation requires the argument to be sorted - >>> >> characters first, then ranges and finally character classes >>> >> ([:digit:]) - but I'm not seeing that the standard requires this. >>> >> Glenn, can you elaborate on this? >>> > >>> > the current implementation of ast tr? >> >>> ./arch/linux.i386-64/bin/ksh -c 'builtin tr ; tr --version' >>> version tr (AT&T Research) 2012-11-12 >> >>> Rephrasing my question: >>> 1. Does the standard, whatever it's name or version, require the tr >>> arguments to be sorted like regex arguments need to be sorted? >>> 2. Does the current AST tr implementation (tr (AT&T Research) >>> 2012-11-12) require the arguments to be sorted? >> >> right, that clarifies "current implementation" >> >> can you point to the text in the standard that >> "requires the argument to be sorted" >> >> ast tr does not require any specific ordering on the user's part >> but note that for -C the user and tr implementation are constrained by >> the collation order in the current locale whereby one command line >> could produce different results for each locale with a differing >> collation order >> >> I can't fathom reliable usage of -C in portable scripts > > Can you fathom reliable usage of tr -C when the locale is using UTF-8 > encoding and follows Unicode standard conventions, i.e. the Unicode > standard collation order? > > Ced > -- > Cedric Blancher <[email protected]> > Institute Pasteur > _______________________________________________ > ast-developers mailing list > [email protected] > http://lists.research.att.com/mailman/listinfo/ast-developers -- , _ _ , { \/`o;====- Olga Kryzhanovska -====;o`\/ } .----'-/`-/ [email protected] \-`\-'----. `'-..-| / http://twitter.com/fleyta \ |-..-'` /\/\ Solaris/BSD//C/C++ programmer /\/\ `--` `--` _______________________________________________ ast-developers mailing list [email protected] http://lists.research.att.com/mailman/listinfo/ast-developers
