Sorry I should have sent previous mail using uft-8 code.
The following is same as previous one except character code.

Hello,

tcs both for plan 9 and for unix has a bug in reading utf text.
that comes from:
utf_in(int fd, long *notused, struct convert *out){
    char buf[N];
    ...
    while((n = read(fd, buf+tot, N-tot)) >= 0){
        ...
}

in utf.c

N is assigned to be 10000 in hdr.h

if you set N to 10, you will find the problem more clearly:
tcs cannot handle correctly utf character boundary.

for example, assume a.txt have the content:
aaaaaaaこの

term% xd -c a.txt
0000000   a  a  a  a  a  a  a e3 81 93 e3 81 ae \n
000000e

tcs can handle this text because N=10 is just uft boundary
but tcs fails if 'a' are 6 or 8 ...

tcs is very important for me.
Who maintains tcs ?
I might help debugging.

Kenji Arisawa


Reply via email to