>I am not at all secure about how the standard GNU utilities will handle >non-ascii characters. For example, 'wc -c', just counts bytes. True, >the man page talks about bytes, not characters, but I am still left >uncomfortable. Then there are the dozens of bash, python, and perl >scripts that I have accumulated over the years.
My experience has been that a modern system handles 8-bit characters just fine. Now, where things get a little tricky is with multibyte character sets like UTF-8. Not everyone has broken from the paradigm that 1 byte == 1 character, like you noted (we had to do a bunch of work in the format engine to fix that). But since UTF-8 has the excellent property that non-ASCII characters look like just 8-bit characters but won't ever be mistaken for ASCII (not a surprise, since it was designed by two of the original Unix geeks) I haven't come across a program where it truely breaks. I don't write in Python, but Perl support for UTF-8 is excellent and I would be shocked if the situation for Python wasn't the same. I jumped whole-hog into UTF-8 a few years ago, and I haven't regretted it one bit. --Ken _______________________________________________ Nmh-workers mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/nmh-workers
