Awk is one of the few programs in the ditribution that is maintained
externally (by Brian Kernighan) and is pulled in via ape and pcc (it might
actually be the only one - I didn't bother to check.) A quick glimpse at
lex.c suggests that awk scans input one char at a time. In hindsight I'm a
bit surprised that I haven't got bitten by this, but I probably didn't split
within multibyte sequences. It's probably not too hard to change awk to read
runes for the price of creating ``the other one true awk.''

        Martin

* Gorka Guardiola ([EMAIL PROTECTED]) wrote:
> I think this has come up before, but I didn't found reply.
> If I do in awk something like:
> 
> split($0, c, "");
> 
> c should be an array of Runes internally, UTF externally, but apparently,
> it is not. Is it just broken?, is there a replacement?, is it just the
> builtins or
> is the whole awk broken?.
> 
> Example, freqpair
> 
> ------
> #!/bin/awk -f
> 
> {
>       n = split($0, c , "");
>       for(i=1; i<n; i++){
>               pair=c[i] c[i+1]
>               f[pair]++;
>       }
> }
> END{
>       for(h in f)
>               printf("%d %s\n", f[h], h);
> }
> 
> ------
> 
> % echo abcd|freqpair
> 1 ab
> 1 cd
> 1 bc
> % echo aícd|freqpair
> 1 cd
> 1 �c
> 1 í
> 1 a�
> 
> 
> where the ? is a Peter face...
> 
> Thanks.
> 
> -- 
> - curiosity sKilled the cat

Reply via email to