Re: [ast-developers] Using multibyte characters as "tr" command arguments (e.g. $ tr ... $'[\u[20a0]-\u[20af]]' #) ...

Glenn Fowler Mon, 19 Nov 2012 12:44:04 -0800

I think you are testing too many things at once
do the basics first and then compose


assume a UTF-8 locale
$'\u[20ac]' is the unicode euro character
use standard range expressions a-z, not sysV [a-z]
(this way you can test other standard tr implementations)
map to [X*] instead of [\n*] to be easy on the eyes

first, -c in C or mb locales uses the code point sorting order
the way that is implemented is
(1) set1 is parsed completely setting a bit for each selected code point
(2) the -c complement sets up a new table indexed by code point
        wchar_t ordered_set1[max_code_point];
        for (c = n = 0; c < max_code_point; c++)
                if (!in_set_1(c))
                        ordered_set1[n++] = c;
(3) if -C were specified instead then ordered_set1[] would be sorted
    according to the LC_COLLATE locale setting
(4) ordered_set1[[] is then used to map 1-1 into set2[] which
    is ordered left-to-right, e.g., the l-r order specified on the command line
(5) this means that for -c and -C the specification order for set1 does not 
matter

to avoid output with no trailing newline \n is always added to set1 for -c/-C
here's a start for some tests in regress(1) form
copy to tr.tst and run
        regress tr.tst
or to test other tr implementations
        regress tr.tst /usr/xpg1234/bin/tr
now when the discussion ends we'll have a regression test to add to the packages
--
UNIT tr

TEST 01 'multibyte exercises'

        EXPORT  LC_CTYPE=en_US.UTF-8

        EXEC    $'\u[20ac]' '[X*]'
                INPUT - $'\u[20ac]'
                OUTPUT - $'X'
        EXEC    $'\u[20a0]-\u[20af]' '[X*]'

        EXEC    -c $'\u[20ac]\n' '[X*]'
                OUTPUT - $'\u[20ac]'
        EXEC    -c $'\u[20a0]-\u[20af]\n' '[X*]'
--

_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] Using multibyte characters as "tr" command arguments (e.g. $ tr ... $'[\u[20a0]-\u[20af]]' #) ...

Reply via email to