[Repost... seems the original email got somehow lost in a mailman server outage... ;-( ]
---------- Forwarded message ---------- From: Roland Mainz <[email protected]> Date: Sat, Apr 6, 2013 at 3:10 AM Subject: Re: [ast-users] Matching accented é with [=e=] using AST tr To: Cedric Blancher <[email protected]>, Glenn Fowler <[email protected]> Cc: [email protected], ast-users <[email protected]> On Fri, Mar 15, 2013 at 3:57 PM, Cedric Blancher <[email protected]> wrote: > On 14 March 2013 23:01, Roland Mainz <[email protected]> wrote: >> On Thu, Mar 14, 2013 at 2:19 PM, Cedric Blancher >> <[email protected]> wrote: >>> How do I match accented e (i.e. é) using an equivalence class in AST tr? >>> >>> Doing that in sed is easy: >>> ~/bin/sed -r "s/[[=e=]]/X/g" <<<"8é8" ; printf "\n" >>> 8X8 >>> >>> But in tr I am not able to get it working: >>> ksh -c 'builtin tr ; tr -Cd "[=e=]" <<<"1e2é3" ; print' >>> e >>> >>> AFAIK this should print "eé". >>> >>> I used: >>> version tr (AT&T Research) 2012-11-12 >>> version sed (AT&T Research) 2012-03-28 >> >> Erm... wIthout digging around... does AST "tr" support the POSIX >> equivalence class syntax yet (Glenn... ping!) ? My first guess would >> be to try another platform like Solaris to see if the issue is >> libc-related... > > Glenn, does AST tr support the [=e=] syntax? [snip] Technically there is code in src/lib/libcmd/tr.c to support [=e=] ... -- snip -- 252 case '.': 253 case '=': 254 if ((q = regcollate((char*)tr->next, (char**)&e, buf, sizeof(buf), &wc)) >= 0) 255 { 256 tr->next = e; 257 c = q ? buf[0] : 0; 258 break; 259 } 260 /*FALLTHROUGH*/ 261 member: 262 if (*(e = tr->next + 1)) 263 { 264 while (*++e && *e != c && *e != ']'); 265 if (*e != ']' && *++e == ']') 266 return -2; 267 } -- snip -- ... but it doesn't seem to work... ;-( The following testcase prints the differences between "tr" and "sed" for a given "tr"-like pattern: -- snip -- set -o nounset IFS='' typeset -li16 i typeset sc # plain character to test typeset sq # character "sc" quoted and wrapped in '=' typeset s1 s2 # tests builtin tr typeset -T pat_t=( typeset lc_all typeset pattern ) integer p pat_t -a patlist=( ( lc_all='en_US.UTF-8' pattern='[=e=]' ) ) for (( p=0 ; p < ${#patlist[@]} ; p++ )) ; do nameref pat=patlist[p] ( export LC_ALL="${pat.lc_all}" for (( i=0x30 ; i< 0x2000 ; i++ )) ; do sc="$(printf "\u[${i#16#}]\n" 2>'/dev/null')" # no pipe here to avoid the costs for |fork()| sq="$(printf "=%s=" "$sc")" s1="$(tr -d "${pat.pattern}" <<<"$sq")" s2="$(sed "s/[${pat.pattern}]//g" <<<"$sq")" [[ "$s1" != "$s2" ]] && printf "%q/%q: %5.5x ch=%s tr=%s sed=%s\n" "${pat.lc_all}" "${pat.pattern}" i "$sc" "$s1" "$s2" done ) done -- snip -- With ast-ksh.2013-04-02 the output looks like this (on SuSE 12.2/AMD64/64bit): -- snip -- $ ~/bin/ksh /tmp/tr_test17.sh en_US.UTF-8/'[=e=]': 00045 ch=E tr==E= sed=== en_US.UTF-8/'[=e=]': 000c8 ch=È tr==È= sed=== en_US.UTF-8/'[=e=]': 000c9 ch=É tr==É= sed=== en_US.UTF-8/'[=e=]': 000ca ch=Ê tr==Ê= sed=== en_US.UTF-8/'[=e=]': 000cb ch=Ë tr==Ë= sed=== en_US.UTF-8/'[=e=]': 000e8 ch=è tr==è= sed=== en_US.UTF-8/'[=e=]': 000e9 ch=é tr==é= sed=== en_US.UTF-8/'[=e=]': 000ea ch=ê tr==ê= sed=== en_US.UTF-8/'[=e=]': 000eb ch=ë tr==ë= sed=== en_US.UTF-8/'[=e=]': 00112 ch=Ē tr==Ē= sed=== en_US.UTF-8/'[=e=]': 00113 ch=ē tr==ē= sed=== en_US.UTF-8/'[=e=]': 00114 ch=Ĕ tr==Ĕ= sed=== en_US.UTF-8/'[=e=]': 00115 ch=ĕ tr==ĕ= sed=== en_US.UTF-8/'[=e=]': 00116 ch=Ė tr==Ė= sed=== en_US.UTF-8/'[=e=]': 00117 ch=ė tr==ė= sed=== en_US.UTF-8/'[=e=]': 00118 ch=Ę tr==Ę= sed=== en_US.UTF-8/'[=e=]': 00119 ch=ę tr==ę= sed=== en_US.UTF-8/'[=e=]': 0011a ch=Ě tr==Ě= sed=== en_US.UTF-8/'[=e=]': 0011b ch=ě tr==ě= sed=== en_US.UTF-8/'[=e=]': 0018e ch=Ǝ tr==Ǝ= sed=== en_US.UTF-8/'[=e=]': 0018f ch=Ə tr==Ə= sed=== en_US.UTF-8/'[=e=]': 00190 ch=Ɛ tr==Ɛ= sed=== en_US.UTF-8/'[=e=]': 001dd ch=ǝ tr==ǝ= sed=== en_US.UTF-8/'[=e=]': 00204 ch=Ȅ tr==Ȅ= sed=== en_US.UTF-8/'[=e=]': 00205 ch=ȅ tr==ȅ= sed=== en_US.UTF-8/'[=e=]': 00206 ch=Ȇ tr==Ȇ= sed=== en_US.UTF-8/'[=e=]': 00207 ch=ȇ tr==ȇ= sed=== en_US.UTF-8/'[=e=]': 00228 ch=Ȩ tr==Ȩ= sed=== en_US.UTF-8/'[=e=]': 00229 ch=ȩ tr==ȩ= sed=== en_US.UTF-8/'[=e=]': 00259 ch=ə tr==ə= sed=== en_US.UTF-8/'[=e=]': 0025b ch=ɛ tr==ɛ= sed=== en_US.UTF-8/'[=e=]': 01e14 ch=Ḕ tr==Ḕ= sed=== en_US.UTF-8/'[=e=]': 01e15 ch=ḕ tr==ḕ= sed=== en_US.UTF-8/'[=e=]': 01e16 ch=Ḗ tr==Ḗ= sed=== en_US.UTF-8/'[=e=]': 01e17 ch=ḗ tr==ḗ= sed=== en_US.UTF-8/'[=e=]': 01e18 ch=Ḙ tr==Ḙ= sed=== en_US.UTF-8/'[=e=]': 01e19 ch=ḙ tr==ḙ= sed=== en_US.UTF-8/'[=e=]': 01e1a ch=Ḛ tr==Ḛ= sed=== en_US.UTF-8/'[=e=]': 01e1b ch=ḛ tr==ḛ= sed=== en_US.UTF-8/'[=e=]': 01e1c ch=Ḝ tr==Ḝ= sed=== en_US.UTF-8/'[=e=]': 01e1d ch=ḝ tr==ḝ= sed=== en_US.UTF-8/'[=e=]': 01eb8 ch=Ẹ tr==Ẹ= sed=== en_US.UTF-8/'[=e=]': 01eb9 ch=ẹ tr==ẹ= sed=== en_US.UTF-8/'[=e=]': 01eba ch=Ẻ tr==Ẻ= sed=== en_US.UTF-8/'[=e=]': 01ebb ch=ẻ tr==ẻ= sed=== en_US.UTF-8/'[=e=]': 01ebc ch=Ẽ tr==Ẽ= sed=== en_US.UTF-8/'[=e=]': 01ebd ch=ẽ tr==ẽ= sed=== en_US.UTF-8/'[=e=]': 01ebe ch=Ế tr==Ế= sed=== en_US.UTF-8/'[=e=]': 01ebf ch=ế tr==ế= sed=== en_US.UTF-8/'[=e=]': 01ec0 ch=Ề tr==Ề= sed=== en_US.UTF-8/'[=e=]': 01ec1 ch=ề tr==ề= sed=== en_US.UTF-8/'[=e=]': 01ec2 ch=Ể tr==Ể= sed=== en_US.UTF-8/'[=e=]': 01ec3 ch=ể tr==ể= sed=== en_US.UTF-8/'[=e=]': 01ec4 ch=Ễ tr==Ễ= sed=== en_US.UTF-8/'[=e=]': 01ec5 ch=ễ tr==ễ= sed=== en_US.UTF-8/'[=e=]': 01ec6 ch=Ệ tr==Ệ= sed=== en_US.UTF-8/'[=e=]': 01ec7 ch=ệ tr==ệ= sed=== -- snip -- AFAIK the test script should print nothing if "sed" and "tr" would match exactly the same on a per-character basis... ---- Bye, Roland -- __ . . __ (o.\ \/ /.o) [email protected] \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 3992797 (;O/ \/ \O;) -- __ . . __ (o.\ \/ /.o) [email protected] \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 3992797 (;O/ \/ \O;) _______________________________________________ ast-developers mailing list [email protected] http://lists.research.att.com/mailman/listinfo/ast-developers
