> I think this has come up before, but I didn't found reply. > If I do in awk something like: > > split($0, c, ""); > > c should be an array of Runes internally, UTF externally, but apparently, > it is not. Is it just broken?, is there a replacement?, is it just the > builtins or > is the whole awk broken?.
i think the comments about this problem are missing the point a bit. utf8 should be transparent to awk unless the situation demands that awk needs to know the length of a character. it's not necessary to keep strings as Rune*s internally to work with utf8. splitting on "" is a special case where awk does need to know the length of a character. e.g. this script should work fine ; cat /tmp/smile #!/bin/awk -f { n = split($0, c, "☺"); for(i = 1; i <= n; i++) print c[i] } ; echo fu☺bar|/tmp/smile fu bar but splitting on "" won't. i attached a patch that fixes this problem as an illustration. i'm not using utflen because pcc won't see it. it's an ugly patch. i don't think i know what a proper fix for awk would be. i wouldn't think there are many cases like this, but i haven't spent much time with awk internals. - erik ------ 9diff run.c /n/sources/plan9//sys/src/cmd/awk/run.c:1191,1196 - run.c:1191,1219 return(False); } + static int + utf8len(char *s) + { + int c, n, i; + + c = *(unsigned char*)s++; + if ((c&0xe0) == 0xc0) + n = 2; + else if ((c&0xf0) == 0xe0) + n = 3; + else if ((c&0xf8) == 0xf0) + n = 4; + else + return 1; //-1; + i = n-1; + if(strlen(s) < i) + return 1; // -1; + for(; i-- && (c = *(unsigned char*)s++);) + if(0x80 != (c&0xc0)) + return 1; //-1; + return n; + } + Cell *split(Node **a, int nnn) /* split(a[0], a[1], a[2]); a[3] is type */ { Cell *x = 0, *y, *ap; /n/sources/plan9//sys/src/cmd/awk/run.c:1279,1290 - run.c:1302,1316 s++; } } else if (sep == 0) { /* new: split(s, a, "") => 1 char/elem */ - for (n = 0; *s != 0; s++) { - char buf[2]; + int i, len; + char buf[5]; + for (n = 0; *s != 0; s += len) { n++; sprintf(num, "%d", n); - buf[0] = *s; - buf[1] = 0; + len = utf8len(s); + for(i = 0; i < len; i++) + buf[i] = s[i]; + buf[len] = 0; if (isdigit(buf[0])) setsymtab(num, buf, atof(buf), STR|NUM, (Array *) ap->sval); else