the code used by sed is embedded in src/lib/libast/regex/regcomp.c the tr code is incomplete and should use something very close to the regcomp code
it will take a few alphas to pull out the common code and make a good api all of this rigmarole because there is no standard way to access the collating sequence for a given locale On Sat, 6 Apr 2013 19:31:25 +0200 Cedric Blancher wrote: > On 6 April 2013 03:45, Roland Mainz <[email protected]> wrote: > > [Repost... seems the original email got somehow lost in a mailman > > server outage... ;-( ] > > > > ---------- Forwarded message ---------- > > From: Roland Mainz <[email protected]> > > Date: Sat, Apr 6, 2013 at 3:10 AM > > Subject: Re: [ast-users] Matching accented é with [=e=] using AST tr > > To: Cedric Blancher <[email protected]>, Glenn Fowler > > <[email protected]> > > Cc: [email protected], ast-users <[email protected]> > > > > > > On Fri, Mar 15, 2013 at 3:57 PM, Cedric Blancher > > <[email protected]> wrote: > >> On 14 March 2013 23:01, Roland Mainz <[email protected]> wrote: > >>> On Thu, Mar 14, 2013 at 2:19 PM, Cedric Blancher > >>> <[email protected]> wrote: > >>>> How do I match accented e (i.e. é) using an equivalence class in AST tr? > >>>> > >>>> Doing that in sed is easy: > >>>> ~/bin/sed -r "s/[[=e=]]/X/g" <<<"8é8" ; printf "\n" > >>>> 8X8 > >>>> > >>>> But in tr I am not able to get it working: > >>>> ksh -c 'builtin tr ; tr -Cd "[=e=]" <<<"1e2é3" ; print' > >>>> e > >>>> > >>>> AFAIK this should print "eé". > >>>> > >>>> I used: > >>>> version tr (AT&T Research) 2012-11-12 > >>>> version sed (AT&T Research) 2012-03-28 > >>> > >>> Erm... wIthout digging around... does AST "tr" support the POSIX > >>> equivalence class syntax yet (Glenn... ping!) ? My first guess would > >>> be to try another platform like Solaris to see if the issue is > >>> libc-related... > >> > >> Glenn, does AST tr support the [=e=] syntax? > > [snip] > > > > Technically there is code in src/lib/libcmd/tr.c to support [=e=] ... > > -- snip -- > > 252 case '.': > > 253 case '=': > > 254 if ((q = regcollate((char*)tr->next, > > (char**)&e, buf, sizeof(buf), &wc)) >= 0) > > 255 { > > 256 tr->next = e; > > 257 c = q ? buf[0] : 0; > > 258 break; > > 259 } > > 260 /*FALLTHROUGH*/ > > 261 member: > > 262 if (*(e = tr->next + 1)) > > 263 { > > 264 while (*++e && *e != c && *e != > > ']'); > > 265 if (*e != ']' && *++e == ']') > > 266 return -2; > > 267 } > > -- snip -- > > ... but it doesn't seem to work... ;-( > > > > The following testcase prints the differences between "tr" and "sed" > > for a given "tr"-like pattern: > > -- snip -- > > set -o nounset > > IFS='' > > > > typeset -li16 i > > typeset sc # plain character to test > > typeset sq # character "sc" quoted and wrapped in '=' > > typeset s1 s2 # tests > > > > builtin tr > > > > typeset -T pat_t=( > > typeset lc_all > > typeset pattern > > ) > > > > integer p > > pat_t -a patlist=( > > ( lc_all='en_US.UTF-8' pattern='[=e=]' ) > > ) > > > > for (( p=0 ; p < ${#patlist[@]} ; p++ )) ; do > > nameref pat=patlist[p] > > ( > > export LC_ALL="${pat.lc_all}" > > for (( i=0x30 ; i< 0x2000 ; i++ )) ; do > > sc="$(printf "\u[${i#16#}]\n" 2>'/dev/null')" > > > > # no pipe here to avoid the costs for |fork()| > > sq="$(printf "=%s=" "$sc")" > > > > s1="$(tr -d "${pat.pattern}" <<<"$sq")" > > s2="$(sed "s/[${pat.pattern}]//g" <<<"$sq")" > > [[ "$s1" != "$s2" ]] && printf "%q/%q: %5.5x > > ch=%s tr=%s sed=%s\n" > > "${pat.lc_all}" "${pat.pattern}" i "$sc" "$s1" "$s2" > > done > > ) > > done > > -- snip -- > > > > With ast-ksh.2013-04-02 the output looks like this (on SuSE > > 12.2/AMD64/64bit): > > -- snip -- > > $ ~/bin/ksh /tmp/tr_test17.sh > > en_US.UTF-8/'[=e=]': 00045 ch=E tr==E= sed=== > > en_US.UTF-8/'[=e=]': 000c8 ch=à tr==Ã= sed=== > > en_US.UTF-8/'[=e=]': 000c9 ch=à tr==Ã= sed=== > > en_US.UTF-8/'[=e=]': 000ca ch=à tr==Ã= sed=== > > en_US.UTF-8/'[=e=]': 000cb ch=à tr==Ã= sed=== > > en_US.UTF-8/'[=e=]': 000e8 ch=è tr==è= sed=== > > en_US.UTF-8/'[=e=]': 000e9 ch=é tr==é= sed=== > > en_US.UTF-8/'[=e=]': 000ea ch=ê tr==ê= sed=== > > en_US.UTF-8/'[=e=]': 000eb ch=ë tr==ë= sed=== > > en_US.UTF-8/'[=e=]': 00112 ch=Ä tr==Ä= sed=== > > en_US.UTF-8/'[=e=]': 00113 ch=Ä tr==Ä= sed=== > > en_US.UTF-8/'[=e=]': 00114 ch=Ä tr==Ä= sed=== > > en_US.UTF-8/'[=e=]': 00115 ch=Ä tr==Ä= sed=== > > en_US.UTF-8/'[=e=]': 00116 ch=Ä tr==Ä= sed=== > > en_US.UTF-8/'[=e=]': 00117 ch=Ä tr==Ä= sed=== > > en_US.UTF-8/'[=e=]': 00118 ch=Ä tr==Ä= sed=== > > en_US.UTF-8/'[=e=]': 00119 ch=Ä tr==Ä= sed=== > > en_US.UTF-8/'[=e=]': 0011a ch=Ä tr==Ä= sed=== > > en_US.UTF-8/'[=e=]': 0011b ch=Ä tr==Ä= sed=== > > en_US.UTF-8/'[=e=]': 0018e ch=Æ tr==Æ= sed=== > > en_US.UTF-8/'[=e=]': 0018f ch=Æ tr==Æ= sed=== > > en_US.UTF-8/'[=e=]': 00190 ch=Æ tr==Æ= sed=== > > en_US.UTF-8/'[=e=]': 001dd ch=Ç tr==Ç= sed=== > > en_US.UTF-8/'[=e=]': 00204 ch=È tr==È= sed=== > > en_US.UTF-8/'[=e=]': 00205 ch=È tr==È = sed=== > > en_US.UTF-8/'[=e=]': 00206 ch=È tr==È= sed=== > > en_US.UTF-8/'[=e=]': 00207 ch=È tr==È= sed=== > > en_US.UTF-8/'[=e=]': 00228 ch=Ȩ tr==Ȩ= sed=== > > en_US.UTF-8/'[=e=]': 00229 ch=È© tr==È©= sed=== > > en_US.UTF-8/'[=e=]': 00259 ch=É tr==É= sed=== > > en_US.UTF-8/'[=e=]': 0025b ch=É tr==É= sed=== > > en_US.UTF-8/'[=e=]': 01e14 ch=Ḡtr==á¸= sed=== > > en_US.UTF-8/'[=e=]': 01e15 ch=Ḡtr==á¸= sed=== > > en_US.UTF-8/'[=e=]': 01e16 ch=Ḡtr==á¸= sed=== > > en_US.UTF-8/'[=e=]': 01e17 ch=Ḡtr==á¸= sed=== > > en_US.UTF-8/'[=e=]': 01e18 ch=Ḡtr==á¸= sed=== > > en_US.UTF-8/'[=e=]': 01e19 ch=Ḡtr==á¸= sed=== > > en_US.UTF-8/'[=e=]': 01e1a ch=Ḡtr==á¸= sed=== > > en_US.UTF-8/'[=e=]': 01e1b ch=Ḡtr==á¸= sed=== > > en_US.UTF-8/'[=e=]': 01e1c ch=Ḡtr==á¸= sed=== > > en_US.UTF-8/'[=e=]': 01e1d ch=Ḡtr==á¸= sed=== > > en_US.UTF-8/'[=e=]': 01eb8 ch=Ẹ tr==Ẹ= sed=== > > en_US.UTF-8/'[=e=]': 01eb9 ch=ẹ tr==ẹ= sed=== > > en_US.UTF-8/'[=e=]': 01eba ch=Ẻ tr==Ẻ= sed=== > > en_US.UTF-8/'[=e=]': 01ebb ch=ẻ tr==ẻ= sed=== > > en_US.UTF-8/'[=e=]': 01ebc ch=Ẽ tr==Ẽ= sed=== > > en_US.UTF-8/'[=e=]': 01ebd ch=ẽ tr==ẽ= sed=== > > en_US.UTF-8/'[=e=]': 01ebe ch=Ế tr==Ế= sed=== > > en_US.UTF-8/'[=e=]': 01ebf ch=ế tr==ế= sed=== > > en_US.UTF-8/'[=e=]': 01ec0 ch=á» tr==á»= sed=== > > en_US.UTF-8/'[=e=]': 01ec1 ch=á» tr==á»= sed=== > > en_US.UTF-8/'[=e=]': 01ec2 ch=á» tr==á»= sed=== > > en_US.UTF-8/'[=e=]': 01ec3 ch=á» tr==á»= sed=== > > en_US.UTF-8/'[=e=]': 01ec4 ch=á» tr==á»= sed=== > > en_US.UTF-8/'[=e=]': 01ec5 ch=á» tr==á» = sed=== > > en_US.UTF-8/'[=e=]': 01ec6 ch=á» tr==á»= sed=== > > en_US.UTF-8/'[=e=]': 01ec7 ch=á» tr==á»= sed=== > > -- snip -- > > > > AFAIK the test script should print nothing if "sed" and "tr" would > > match exactly the same on a per-character basis... > The message still doesn't show up in > http://lists.research.att.com/pipermail/ast-developers/2013q2/date.html > Does the list still work? > Ced > -- > Cedric Blancher <[email protected]> > Institute Pasteur
_______________________________________________ ast-developers mailing list [email protected] http://lists.research.att.com/mailman/listinfo/ast-developers
