> I think this has come up before, but I didn't found reply.
> If I do in awk something like:
> 
> split($0, c, "");
> 
> c should be an array of Runes internally, UTF externally, but apparently,
> it is not. Is it just broken?, is there a replacement?, is it just the
> builtins or
> is the whole awk broken?.

i think the comments about this problem are missing the point
a bit.  utf8 should be transparent to awk unless the situation demands
that awk needs to know the length of a character.  it's not necessary
to keep strings as Rune*s internally to work with utf8.  splitting on
"" is a special case where awk does need to know the length of
a character.  e.g. this script should work fine

        ; cat /tmp/smile
        #!/bin/awk -f
        {
                n = split($0, c, "☺");
                for(i = 1; i <= n; i++)
                        print c[i]
        }
        ; echo fu☺bar|/tmp/smile
        fu
        bar

but splitting on "" won't.  i attached a patch that fixes this problem
as an illustration.  i'm not using utflen because pcc won't see it.
it's an ugly patch.

i don't think i know what a proper fix for awk would be.  i wouldn't
think there are many cases like this, but i haven't spent much time
with awk internals.

- erik

------

9diff run.c
/n/sources/plan9//sys/src/cmd/awk/run.c:1191,1196 - run.c:1191,1219
        return(False);
  }
  
+ static int
+ utf8len(char *s)
+ {
+       int c, n, i;
+ 
+       c = *(unsigned char*)s++;
+       if ((c&0xe0) == 0xc0)
+               n = 2;
+       else if ((c&0xf0) == 0xe0)
+               n = 3;
+       else if ((c&0xf8) == 0xf0)
+               n = 4;
+       else
+               return 1;       //-1;
+       i = n-1;
+       if(strlen(s) < i)
+               return 1;               // -1;
+       for(; i-- && (c = *(unsigned char*)s++);)
+               if(0x80 != (c&0xc0))
+                       return 1;       //-1;
+       return n;
+ }
+ 
  Cell *split(Node **a, int nnn)        /* split(a[0], a[1], a[2]); a[3] is 
type */
  {
        Cell *x = 0, *y, *ap;
/n/sources/plan9//sys/src/cmd/awk/run.c:1279,1290 - run.c:1302,1316
                                s++;
                }
        } else if (sep == 0) {  /* new: split(s, a, "") => 1 char/elem */
-               for (n = 0; *s != 0; s++) {
-                       char buf[2];
+               int i, len;
+               char buf[5];
+               for (n = 0; *s != 0; s += len) {
                        n++;
                        sprintf(num, "%d", n);
-                       buf[0] = *s;
-                       buf[1] = 0;
+                       len = utf8len(s);
+                       for(i = 0; i < len; i++)
+                               buf[i] = s[i];
+                       buf[len] = 0;
                        if (isdigit(buf[0]))
                                setsymtab(num, buf, atof(buf), STR|NUM, (Array 
*) ap->sval);
                        else

Reply via email to