Re: [ast-developers] [ast-users] Matching accented é with [=e=] using AST tr

Glenn Fowler Mon, 08 Apr 2013 10:50:02 -0700

the code used by sed is embedded in src/lib/libast/regex/regcomp.c
the tr code is incomplete and should use something very close to
the regcomp code


it will take a few alphas to pull out the common code and make a good api

all of this rigmarole because there is no standard way to access
the collating sequence for a given locale

On Sat, 6 Apr 2013 19:31:25 +0200 Cedric Blancher wrote:
> On 6 April 2013 03:45, Roland Mainz <[email protected]> wrote:
> > [Repost... seems the original email got somehow lost in a mailman
> > server outage... ;-( ]
> >
> > ---------- Forwarded message ----------
> > From: Roland Mainz <[email protected]>
> > Date: Sat, Apr 6, 2013 at 3:10 AM
> > Subject: Re: [ast-users] Matching accented Ã© with [=e=] using AST tr
> > To: Cedric Blancher <[email protected]>, Glenn Fowler
> > <[email protected]>
> > Cc: [email protected], ast-users <[email protected]>
> >
> >
> > On Fri, Mar 15, 2013 at 3:57 PM, Cedric Blancher
> > <[email protected]> wrote:
> >> On 14 March 2013 23:01, Roland Mainz <[email protected]> wrote:
> >>> On Thu, Mar 14, 2013 at 2:19 PM, Cedric Blancher
> >>> <[email protected]> wrote:
> >>>> How do I match accented e (i.e. Ã©) using an equivalence class in AST tr?
> >>>>
> >>>> Doing that in sed is easy:
> >>>> ~/bin/sed -r "s/[[=e=]]/X/g" <<<"8Ã©8" ; printf "\n"
> >>>> 8X8
> >>>>
> >>>> But in tr I am not able to get it working:
> >>>> ksh -c 'builtin tr ; tr -Cd "[=e=]" <<<"1e2Ã©3" ; print'
> >>>> e
> >>>>
> >>>> AFAIK this should print "eÃ©".
> >>>>
> >>>> I used:
> >>>>   version         tr (AT&T Research) 2012-11-12
> >>>>   version         sed (AT&T Research) 2012-03-28
> >>>
> >>> Erm... wIthout digging around... does AST "tr" support the POSIX
> >>> equivalence class syntax yet (Glenn... ping!) ? My first guess would
> >>> be to try another platform like Solaris to see if the issue is
> >>> libc-related...
> >>
> >> Glenn, does AST tr support the [=e=] syntax?
> > [snip]
> >
> > Technically there is code in src/lib/libcmd/tr.c to support [=e=] ...
> > -- snip --
> >    252                  case '.':
> >    253                  case '=':
> >    254                          if ((q = regcollate((char*)tr->next,
> > (char**)&e, buf, sizeof(buf), &wc)) >= 0)
> >    255                          {
> >    256                                  tr->next = e;
> >    257                                  c = q ? buf[0] : 0;
> >    258                                  break;
> >    259                          }
> >    260                          /*FALLTHROUGH*/
> >    261                  member:
> >    262                          if (*(e = tr->next + 1))
> >    263                          {
> >    264                                  while (*++e && *e != c && *e != 
> > ']');
> >    265                                  if (*e != ']' && *++e == ']')
> >    266                                          return -2;
> >    267                          }
> > -- snip --
> > ... but it doesn't seem to work... ;-(
> >
> > The following testcase prints the differences between "tr" and "sed"
> > for a given "tr"-like pattern:
> > -- snip --
> > set -o nounset
> > IFS=''
> >
> > typeset -li16 i
> > typeset sc # plain character to test
> > typeset sq # character "sc" quoted and wrapped in '='
> > typeset s1 s2 # tests
> >
> > builtin tr
> >
> > typeset -T pat_t=(
> >         typeset lc_all
> >         typeset pattern
> > )
> >
> > integer p
> > pat_t -a patlist=(
> >         ( lc_all='en_US.UTF-8' pattern='[=e=]' )
> > )
> >
> > for (( p=0 ; p < ${#patlist[@]} ; p++ )) ; do
> >         nameref pat=patlist[p]
> >         (
> >                 export LC_ALL="${pat.lc_all}"
> >                 for (( i=0x30 ; i< 0x2000 ; i++ )) ; do
> >                         sc="$(printf "\u[${i#16#}]\n" 2>'/dev/null')"
> >
> >                         # no pipe here to avoid the costs for |fork()|
> >                         sq="$(printf "=%s=" "$sc")"
> >
> >                         s1="$(tr -d "${pat.pattern}" <<<"$sq")"
> >                         s2="$(sed "s/[${pat.pattern}]//g" <<<"$sq")"
> >                         [[ "$s1" != "$s2" ]] && printf "%q/%q: %5.5x
> > ch=%s tr=%s sed=%s\n"
> > "${pat.lc_all}" "${pat.pattern}" i "$sc" "$s1" "$s2"
> >                 done
> >         )
> > done
> > -- snip --
> >
> > With ast-ksh.2013-04-02 the output looks like this (on SuSE 
> > 12.2/AMD64/64bit):
> > -- snip --
> > $ ~/bin/ksh /tmp/tr_test17.sh
> > en_US.UTF-8/'[=e=]': 00045 ch=E tr==E= sed===
> > en_US.UTF-8/'[=e=]': 000c8 ch=Ã tr==Ã= sed===
> > en_US.UTF-8/'[=e=]': 000c9 ch=Ã tr==Ã= sed===
> > en_US.UTF-8/'[=e=]': 000ca ch=Ã tr==Ã= sed===
> > en_US.UTF-8/'[=e=]': 000cb ch=Ã tr==Ã= sed===
> > en_US.UTF-8/'[=e=]': 000e8 ch=Ã¨ tr==Ã¨= sed===
> > en_US.UTF-8/'[=e=]': 000e9 ch=Ã© tr==Ã©= sed===
> > en_US.UTF-8/'[=e=]': 000ea ch=Ãª tr==Ãª= sed===
> > en_US.UTF-8/'[=e=]': 000eb ch=Ã« tr==Ã«= sed===
> > en_US.UTF-8/'[=e=]': 00112 ch=Ä tr==Ä= sed===
> > en_US.UTF-8/'[=e=]': 00113 ch=Ä tr==Ä= sed===
> > en_US.UTF-8/'[=e=]': 00114 ch=Ä tr==Ä= sed===
> > en_US.UTF-8/'[=e=]': 00115 ch=Ä tr==Ä= sed===
> > en_US.UTF-8/'[=e=]': 00116 ch=Ä tr==Ä= sed===
> > en_US.UTF-8/'[=e=]': 00117 ch=Ä tr==Ä= sed===
> > en_US.UTF-8/'[=e=]': 00118 ch=Ä tr==Ä= sed===
> > en_US.UTF-8/'[=e=]': 00119 ch=Ä tr==Ä= sed===
> > en_US.UTF-8/'[=e=]': 0011a ch=Ä tr==Ä= sed===
> > en_US.UTF-8/'[=e=]': 0011b ch=Ä tr==Ä= sed===
> > en_US.UTF-8/'[=e=]': 0018e ch=Æ tr==Æ= sed===
> > en_US.UTF-8/'[=e=]': 0018f ch=Æ tr==Æ= sed===
> > en_US.UTF-8/'[=e=]': 00190 ch=Æ tr==Æ= sed===
> > en_US.UTF-8/'[=e=]': 001dd ch=Ç tr==Ç= sed===
> > en_US.UTF-8/'[=e=]': 00204 ch=È tr==È= sed===
> > en_US.UTF-8/'[=e=]': 00205 ch=È tr==È= sed===
> > en_US.UTF-8/'[=e=]': 00206 ch=È tr==È= sed===
> > en_US.UTF-8/'[=e=]': 00207 ch=È tr==È= sed===
> > en_US.UTF-8/'[=e=]': 00228 ch=È¨ tr==È¨= sed===
> > en_US.UTF-8/'[=e=]': 00229 ch=È© tr==È©= sed===
> > en_US.UTF-8/'[=e=]': 00259 ch=É tr==É= sed===
> > en_US.UTF-8/'[=e=]': 0025b ch=É tr==É= sed===
> > en_US.UTF-8/'[=e=]': 01e14 ch=á¸ tr==á¸= sed===
> > en_US.UTF-8/'[=e=]': 01e15 ch=á¸ tr==á¸= sed===
> > en_US.UTF-8/'[=e=]': 01e16 ch=á¸ tr==á¸= sed===
> > en_US.UTF-8/'[=e=]': 01e17 ch=á¸ tr==á¸= sed===
> > en_US.UTF-8/'[=e=]': 01e18 ch=á¸ tr==á¸= sed===
> > en_US.UTF-8/'[=e=]': 01e19 ch=á¸ tr==á¸= sed===
> > en_US.UTF-8/'[=e=]': 01e1a ch=á¸ tr==á¸= sed===
> > en_US.UTF-8/'[=e=]': 01e1b ch=á¸ tr==á¸= sed===
> > en_US.UTF-8/'[=e=]': 01e1c ch=á¸ tr==á¸= sed===
> > en_US.UTF-8/'[=e=]': 01e1d ch=á¸ tr==á¸= sed===
> > en_US.UTF-8/'[=e=]': 01eb8 ch=áº¸ tr==áº¸= sed===
> > en_US.UTF-8/'[=e=]': 01eb9 ch=áº¹ tr==áº¹= sed===
> > en_US.UTF-8/'[=e=]': 01eba ch=áºº tr==áºº= sed===
> > en_US.UTF-8/'[=e=]': 01ebb ch=áº» tr==áº»= sed===
> > en_US.UTF-8/'[=e=]': 01ebc ch=áº¼ tr==áº¼= sed===
> > en_US.UTF-8/'[=e=]': 01ebd ch=áº½ tr==áº½= sed===
> > en_US.UTF-8/'[=e=]': 01ebe ch=áº¾ tr==áº¾= sed===
> > en_US.UTF-8/'[=e=]': 01ebf ch=áº¿ tr==áº¿= sed===
> > en_US.UTF-8/'[=e=]': 01ec0 ch=á» tr==á»= sed===
> > en_US.UTF-8/'[=e=]': 01ec1 ch=á» tr==á»= sed===
> > en_US.UTF-8/'[=e=]': 01ec2 ch=á» tr==á»= sed===
> > en_US.UTF-8/'[=e=]': 01ec3 ch=á» tr==á»= sed===
> > en_US.UTF-8/'[=e=]': 01ec4 ch=á» tr==á»= sed===
> > en_US.UTF-8/'[=e=]': 01ec5 ch=á» tr==á»= sed===
> > en_US.UTF-8/'[=e=]': 01ec6 ch=á» tr==á»= sed===
> > en_US.UTF-8/'[=e=]': 01ec7 ch=á» tr==á»= sed===
> > -- snip --
> >
> > AFAIK the test script should print nothing if "sed" and "tr" would
> > match exactly the same on a per-character basis...

> The message still doesn't show up in
> http://lists.research.att.com/pipermail/ast-developers/2013q2/date.html

> Does the list still work?

> Ced
> -- 
> Cedric Blancher <[email protected]>
> Institute Pasteur

_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] [ast-users] Matching accented é with [=e=] using AST tr

Reply via email to