Awk is one of the few programs in the ditribution that is maintained externally (by Brian Kernighan) and is pulled in via ape and pcc (it might actually be the only one - I didn't bother to check.) A quick glimpse at lex.c suggests that awk scans input one char at a time. In hindsight I'm a bit surprised that I haven't got bitten by this, but I probably didn't split within multibyte sequences. It's probably not too hard to change awk to read runes for the price of creating ``the other one true awk.''
Martin * Gorka Guardiola ([EMAIL PROTECTED]) wrote: > I think this has come up before, but I didn't found reply. > If I do in awk something like: > > split($0, c, ""); > > c should be an array of Runes internally, UTF externally, but apparently, > it is not. Is it just broken?, is there a replacement?, is it just the > builtins or > is the whole awk broken?. > > Example, freqpair > > ------ > #!/bin/awk -f > > { > n = split($0, c , ""); > for(i=1; i<n; i++){ > pair=c[i] c[i+1] > f[pair]++; > } > } > END{ > for(h in f) > printf("%d %s\n", f[h], h); > } > > ------ > > % echo abcd|freqpair > 1 ab > 1 cd > 1 bc > % echo aícd|freqpair > 1 cd > 1 �c > 1 í > 1 a� > > > where the ? is a Peter face... > > Thanks. > > -- > - curiosity sKilled the cat