Glenn, as context, see http://www.unicode.org/faq/collation.html and
http://www.unicode.org/reports/tr10/ (Unicode Technical Standard #10:
Unicode Collation Algorithm).

I think, because the libc locale modules do implement the collation
sorting, the question is, do the UTF-8 locales implement the Unicode
standard collation algorithm?
This might be a question for Dr. Fink.

Olga

On Mon, Dec 3, 2012 at 6:25 PM, Cedric Blancher
<[email protected]> wrote:
> On 3 December 2012 18:16, Glenn Fowler <[email protected]> wrote:
>>
>> On Mon, 3 Dec 2012 18:07:00 +0100 Cedric Blancher wrote:
>>> On 20 November 2012 16:27, Glenn Fowler <[email protected]> wrote:
>>> >
>>> > On Tue, 20 Nov 2012 10:04:36 +0100 Cedric Blancher wrote:
>>> >> On 17 November 2012 11:25, Roland Mainz <[email protected]> wrote:
>>> >> > On Fri, Nov 16, 2012 at 6:00 PM, Roland Mainz 
>>> >> > <[email protected]> wrote:
>>> >> >> On Fri, Nov 16, 2012 at 5:57 PM, Roland Mainz 
>>> >> >> <[email protected]> wrote:
>>> >> >>> The following testcase (which should basically test whether the
>>> >> >>> SystemV "tr" range expression [a-z] works with 'a' and 'z' replaced
>>> >> >>> with \u[20a0] and \u[20af] ...) ...
>>> >> >>> -- snip --
>>> >> >>> $ ~/bin/ksh -x -c $'builtin tr ; tr -c
>>> >> >>> $\'[:digit:][\u[20a0]-\u[20af]][:alpha:]\' "[\\n*]" <<<$\'hello
>>> >> >>> chicken \u[20ac] world\' ; true'
>>> >> >>> -- snip --
>>> >> >>> ... should AFAIK print something like this:
>>> >> >>> -- snip --
>>> >> >>> + builtin tr
>>> >> >>> + tr -c $'[:digit:][\u[20a0]-\u[20af]][:alpha:]' '[\n*]'
>>> >> >>> + 0<<< hello chicken € world
>>> >> >>> hello
>>> >> >>> chicken
>>> >> >>>
>>> >> >>>
>>> >> >>> world
>>> >> >>> + true
>>> >> >>>
>>> >> >>> -- snip --
>>> >> >>> ... but ast-ksh.2012-11-24 with Glenn's latest tr.c changes gives 
>>> >> >>> this output:
>>> >> >>> -- snip --
>>> >> >>> + builtin tr
>>> >> >>> + tr -c $'[:digit:][\u[20a0]-\u[20af]][:alpha:]' '[\n*]'
>>> >> >>> + 0<<< hello chicken € world
>>> >> >>> hello
>>> >> >>> chicken
>>> >> >>> €
>>> >> >>> world
>>> >> >>> + true
>>> >> >>>
>>> >> >>> -- snip --
>>> >> >>>
>>> >> >>> Erm... does anyone spot the mistake ? Or is this a AST "tr" bug ?
>>> >> >>
>>> >> >> BTW: It seems to work if I remove the leading [:digit:] expression:
>>> >> >> -- snip --
>>> >> >> $ ~/bin/ksh -x -c $'builtin tr ; tr -c
>>> >> >> $\'[\u[20a0]-\u[20af]][:alpha:]\' "[\\n*]" <<<$\'hello chicken
>>> >> >> \u[20ac] world\' ; true'
>>> >> >> + builtin tr
>>> >> >> + tr -c $'[\u[20a0]-\u[20af]][:alpha:]' '[\n*]'
>>> >> >> + 0<<< hello chicken € world
>>> >> >> hello
>>> >> >> chicken
>>> >> >> €
>>> >> >> world
>>> >> >> + true
>>> >> >> -- snip --
>>> >> >
>>> >> > ... or if I put the [:digit:] at the end:
>>> >> > -- snip --
>>> >> > $ ~/bin/ksh -x -c $'builtin tr ; tr -c
>>> >> > $\'[\u[20a0]-\u[20af]][:alpha:][:digit:]\' "[\\n*]" <<<$\'hello
>>> >> > chicken 6a \u[20ac] world\' ; true'
>>> >> > + builtin tr
>>> >> > + tr -c $'[\u[20a0]-\u[20af]][:alpha:][:digit:]' '[\n*]'
>>> >> > + 0<<< hello chicken 6a € world
>>> >> > hello
>>> >> > chicken
>>> >> > 6a
>>> >> > €
>>> >> > world
>>> >> > + true
>>> >> > -- snip --
>>> >> >
>>> >> > ... erm... question for Glenn:
>>> >> > Must range patterns (e.g. [a-z] or 'a' and 'z' replaced by Unicode
>>> >> > characters) be sorted before character classes like [:digit:] or
>>> >> > [:alpha:] (this may be a case where a --strict option should
>>> >> > warn/complain if the arguments must be sorted) ?
>>> >
>>> >> The current implementation requires the argument to be sorted -
>>> >> characters first, then ranges and finally character classes
>>> >> ([:digit:]) - but I'm not seeing that the standard requires this.
>>> >> Glenn, can you elaborate on this?
>>> >
>>> > the current implementation of ast tr?
>>
>>>  ./arch/linux.i386-64/bin/ksh -c 'builtin tr ; tr --version'
>>>   version         tr (AT&T Research) 2012-11-12
>>
>>> Rephrasing my question:
>>> 1. Does the standard, whatever it's name or version, require the tr
>>> arguments to be sorted like regex arguments need to be sorted?
>>> 2. Does the current AST tr implementation (tr (AT&T Research)
>>> 2012-11-12) require the arguments to be sorted?
>>
>> right, that clarifies "current implementation"
>>
>> can you point to the text in the standard that
>> "requires the argument to be sorted"
>>
>> ast tr does not require any specific ordering on the user's part
>> but note that for -C the user and tr implementation are constrained by
>> the collation order in the current locale whereby one command line
>> could produce different results for each locale with a differing
>> collation order
>>
>> I can't fathom reliable usage of -C in portable scripts
>
> Can you fathom reliable usage of tr -C when the locale is using UTF-8
> encoding and follows Unicode standard conventions, i.e. the Unicode
> standard collation order?
>
> Ced
> --
> Cedric Blancher <[email protected]>
> Institute Pasteur
> _______________________________________________
> ast-developers mailing list
> [email protected]
> http://lists.research.att.com/mailman/listinfo/ast-developers



-- 
      ,   _                                    _   ,
     { \/`o;====-    Olga Kryzhanovska   -====;o`\/ }
.----'-/`-/     [email protected]   \-`\-'----.
 `'-..-| /       http://twitter.com/fleyta     \ |-..-'`
      /\/\     Solaris/BSD//C/C++ programmer   /\/\
      `--`                                      `--`
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Reply via email to