this now works

        LC_ALL=en_US.UTF-8 tr '\u[20a0]-\u[20af]' '[\n*]' <<< 
$'hello\u[20a1]world'

On Mon, 3 Dec 2012 18:25:32 +0100 Cedric Blancher wrote:
> On 3 December 2012 18:16, Glenn Fowler <[email protected]> wrote:
> >
> > On Mon, 3 Dec 2012 18:07:00 +0100 Cedric Blancher wrote:
> >> On 20 November 2012 16:27, Glenn Fowler <[email protected]> wrote:
> >> >
> >> > On Tue, 20 Nov 2012 10:04:36 +0100 Cedric Blancher wrote:
> >> >> On 17 November 2012 11:25, Roland Mainz <[email protected]> 
> >> >> wrote:
> >> >> > On Fri, Nov 16, 2012 at 6:00 PM, Roland Mainz 
> >> >> > <[email protected]> wrote:
> >> >> >> On Fri, Nov 16, 2012 at 5:57 PM, Roland Mainz 
> >> >> >> <[email protected]> wrote:
> >> >> >>> The following testcase (which should basically test whether the
> >> >> >>> SystemV "tr" range expression [a-z] works with 'a' and 'z' replaced
> >> >> >>> with \u[20a0] and \u[20af] ...) ...
> >> >> >>> -- snip --
> >> >> >>> $ ~/bin/ksh -x -c $'builtin tr ; tr -c
> >> >> >>> $\'[:digit:][\u[20a0]-\u[20af]][:alpha:]\' "[\\n*]" <<<$\'hello
> >> >> >>> chicken \u[20ac] world\' ; true'
> >> >> >>> -- snip --
> >> >> >>> ... should AFAIK print something like this:
> >> >> >>> -- snip --
> >> >> >>> + builtin tr
> >> >> >>> + tr -c $'[:digit:][\u[20a0]-\u[20af]][:alpha:]' '[\n*]'
> >> >> >>> + 0<<< hello chicken € world
> >> >> >>> hello
> >> >> >>> chicken
> >> >> >>>
> >> >> >>>
> >> >> >>> world
> >> >> >>> + true
> >> >> >>>
> >> >> >>> -- snip --
> >> >> >>> ... but ast-ksh.2012-11-24 with Glenn's latest tr.c changes gives 
> >> >> >>> this output:
> >> >> >>> -- snip --
> >> >> >>> + builtin tr
> >> >> >>> + tr -c $'[:digit:][\u[20a0]-\u[20af]][:alpha:]' '[\n*]'
> >> >> >>> + 0<<< hello chicken € world
> >> >> >>> hello
> >> >> >>> chicken
> >> >> >>> €
> >> >> >>> world
> >> >> >>> + true
> >> >> >>>
> >> >> >>> -- snip --
> >> >> >>>
> >> >> >>> Erm... does anyone spot the mistake ? Or is this a AST "tr" bug ?
> >> >> >>
> >> >> >> BTW: It seems to work if I remove the leading [:digit:] expression:
> >> >> >> -- snip --
> >> >> >> $ ~/bin/ksh -x -c $'builtin tr ; tr -c
> >> >> >> $\'[\u[20a0]-\u[20af]][:alpha:]\' "[\\n*]" <<<$\'hello chicken
> >> >> >> \u[20ac] world\' ; true'
> >> >> >> + builtin tr
> >> >> >> + tr -c $'[\u[20a0]-\u[20af]][:alpha:]' '[\n*]'
> >> >> >> + 0<<< hello chicken € world
> >> >> >> hello
> >> >> >> chicken
> >> >> >> €
> >> >> >> world
> >> >> >> + true
> >> >> >> -- snip --
> >> >> >
> >> >> > ... or if I put the [:digit:] at the end:
> >> >> > -- snip --
> >> >> > $ ~/bin/ksh -x -c $'builtin tr ; tr -c
> >> >> > $\'[\u[20a0]-\u[20af]][:alpha:][:digit:]\' "[\\n*]" <<<$\'hello
> >> >> > chicken 6a \u[20ac] world\' ; true'
> >> >> > + builtin tr
> >> >> > + tr -c $'[\u[20a0]-\u[20af]][:alpha:][:digit:]' '[\n*]'
> >> >> > + 0<<< hello chicken 6a € world
> >> >> > hello
> >> >> > chicken
> >> >> > 6a
> >> >> > €
> >> >> > world
> >> >> > + true
> >> >> > -- snip --
> >> >> >
> >> >> > ... erm... question for Glenn:
> >> >> > Must range patterns (e.g. [a-z] or 'a' and 'z' replaced by Unicode
> >> >> > characters) be sorted before character classes like [:digit:] or
> >> >> > [:alpha:] (this may be a case where a --strict option should
> >> >> > warn/complain if the arguments must be sorted) ?
> >> >
> >> >> The current implementation requires the argument to be sorted -
> >> >> characters first, then ranges and finally character classes
> >> >> ([:digit:]) - but I'm not seeing that the standard requires this.
> >> >> Glenn, can you elaborate on this?
> >> >
> >> > the current implementation of ast tr?
> >
> >>  ./arch/linux.i386-64/bin/ksh -c 'builtin tr ; tr --version'
> >>   version         tr (AT&T Research) 2012-11-12
> >
> >> Rephrasing my question:
> >> 1. Does the standard, whatever it's name or version, require the tr
> >> arguments to be sorted like regex arguments need to be sorted?
> >> 2. Does the current AST tr implementation (tr (AT&T Research)
> >> 2012-11-12) require the arguments to be sorted?
> >
> > right, that clarifies "current implementation"
> >
> > can you point to the text in the standard that
> > "requires the argument to be sorted"
> >
> > ast tr does not require any specific ordering on the user's part
> > but note that for -C the user and tr implementation are constrained by
> > the collation order in the current locale whereby one command line
> > could produce different results for each locale with a differing
> > collation order
> >
> > I can't fathom reliable usage of -C in portable scripts

> Can you fathom reliable usage of tr -C when the locale is using UTF-8
> encoding and follows Unicode standard conventions, i.e. the Unicode
> standard collation order?

> Ced
> -- 
> Cedric Blancher <[email protected]>
> Institute Pasteur

_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Reply via email to