Eli Zaretskii <[email protected]>: >> From: Marko Rauhamaa <[email protected]> >> >> UTF-8 beautifully bridges the interpretation gap between 8-bit character >> strings and text. However, the interpretation step should be done in the >> application and not in the programming language. > > You can't do that in an environment that specifically targets > sophisticated multi-lingual text processing independent of the outside > locale. Unless you can interpret byte sequences as characters, you > will be unable to even count characters in a range of text,
If you need to operate on Unicode text, have the application invoke the UTF-8 (or locale-specific) decoder. However, have the application request it instead of guessing that the environment is all Unicode. > You do need "other typesetting effects", naturally, but that doesn't > mean you can get away without more or less full support of Unicode > nowadays. Do support it, fully even, but let the application invoke the conversion when appropriate. > You are talking about programming, but we should instead think about > applications -- those of them which need to process text, or even > access files, as this discussion shows, do need decent Unicode > support. Why should opening a file require Unicode support if the underlying operating system knows nothing about Unicode? I can open a any given file in a tiny C program without any Unicode support, under Linux, that is. > E.g., users generally expect that decomposed and composed character > sequences behave and are treated identically, although they are > different byte-stream wise. Linux begs to differ. Regardless of the locale, two different octet sequences that ought to be equivalent UTF-8-wise will be considered different pathnames under Linux. I don't need a helicopter to walk across the street. >> But is also causing unnecessary grief in the computer-computer >> interface, where the classic textual naming and textual protocols >> are actually cutely chosen octet-aligned binary formats. > > The universal acceptance of UTF-8 nowadays makes this much less of an > issue, IME. You are jumping the gun. Linux won't be there for a long time if ever. Nothing prevents a pathname, or a command-line argument, or an environment variable, or the standard input from containing illegal UTF-8. I also wouldn't like my SMTP server to throw a UTF-8 decoding exception on parsing a command. (Also note that even Windows allows pathnames with illegal Unicode in them if I'm not mistaken.) Marko
