Sorry I should have sent previous mail using uft-8 code.
The following is same as previous one except character code.
Hello,
tcs both for plan 9 and for unix has a bug in reading utf text.
that comes from:
utf_in(int fd, long *notused, struct convert *out){
char buf[N];
...
while((n = read(fd, buf+tot, N-tot)) >= 0){
...
}
in utf.c
N is assigned to be 10000 in hdr.h
if you set N to 10, you will find the problem more clearly:
tcs cannot handle correctly utf character boundary.
for example, assume a.txt have the content:
aaaaaaaこの
term% xd -c a.txt
0000000 a a a a a a a e3 81 93 e3 81 ae \n
000000e
tcs can handle this text because N=10 is just uft boundary
but tcs fails if 'a' are 6 or 8 ...
tcs is very important for me.
Who maintains tcs ?
I might help debugging.
Kenji Arisawa