> From: Rob Browning <r...@defaultvalue.org> > Date: Sat, 06 Jul 2024 15:32:17 -0500 > > * Problem > > System data like environment variables, user names, group names, file > paths and extended attributes (xattr), etc. are on some systems (like > Linux) binary data, and may not be encodable as a string in the current > locale. For Linux, as an example, only the null character is an invalid > user/group/filename byte, while for UTF-8, a much smaller set of bytes > are valid[1]. > > As an example, "ยต" (Greek Mu) when encoded as Latin-1 is 0xb5, which is > a completely invalid UTF-8 byte, but a perfectly legitimate Linux file > name. As a result, (readdir dir) will return a corrupted value when the > locale is set to UTF-8. > > You can try it yourself from bash if your current locale uses an > LC_CTYPE that's incompatible with 0xb5: > > $ locale | grep LC_CTYPE > LC_CTYPE="en_US.utf8" > $ guile -c '(write (program-arguments)) (newline)' $'\xb5' > ("guile" "?") > > You end up with a question mark instead of the correct value. This > makes it difficult to write programs that don't risk silent corruption > unless all the relevant system data is known to be compatible with the > user's current locale. > > It's perhaps worth noting, that while typically unlikely, any given > directory could contain paths in an arbitrary collection of encodings: > UTF-8, SHIFT-JIS, Latin-1, etc., and so if you really want to try to > handle them as strings (maybe you want to correctly upcase/downcase > them), you have to know (somehow) the encoding that applies to each one. > Otherwise, in the limiting case, you can only assume "bytes".
Why not learn from GNU Emacs, which already solved this very hard problem, and has many years of user and programming experience to prove it, instead of inventing Guile's own solution? Here's what we learned in Emacs, since 1997 (when Emacs 20.1 was released that for the first time tried to provide an environment that supports multiple languages and encodings at the same time); . Locales are not a good mechanism for this. A locale supports a single language/encoding, and switching the locale each time you need a different one is costly and makes many simple operations cumbersome, and the code hard to read. . It follows that relying on libc functions that process non-ASCII characters is also not the best idea: those functions depend on the locale, and thus force the programmer to use locales and switch them as needed. . Byte sequences that cannot be decoded for some reason are a fact of life, and any real-life programming system must be able to deal with them in a reasonable and efficient way. . Therefore, Emacs has arrived at the following system, and we use it for the last 15 years without any significant changes: - When text is read from an external source, it is _decoded_ into the internal representation of characters. When text is written to an external destination, it is _encoded_ using an appropriate codeset. - The internal representation is a superset of UTF-8, in that it is capable of representing characters for which there are no Unicode codepoints (such as GB 18030, some of whose characters don't have Unicode counterparts; and raw bytes, used to represent byte sequences that cannot be decoded). It uses 5-byte UTF-8-like sequences for these extensions. - The codesets used to decode and encode can be selected by simple settings, and have defaults which are locale- and language-aware. When the encoding of external text is not known, Emacs uses a series of guesses, driven by the locale, the nature of the source (e.g., file name), user preferences, etc. Encoding generally reuses the same codeset used to decode (which is recorded with the text), and the Lisp program can override that. - Separate global variables and corresponding functions are provided for decoding/encoding stuff that comes from several important sources and goes to the corresponding destinations. Examples include en/decoding of file names, en/decoding of text from files, en/decoding values of environment variables and system messages (e.g., messages from strerror), and en/decoding text from subordinate processes. Each of these gets the default value based on the locale and the language detected at startup, but a Lisp program can modify each one of them, either temporarily or globally. There are also facilities for adapting these to specific requirements of particular external sources and destinations: for example, one can define special codesets for encoding and decoding text from/to specific programs run by Emacs, based on the program names. (E.g., Git generally wants UTF-8 encoding regardless of the locale.) Similarly, some specific file names are known to use certain encodings. All of these are used to determine the proper codeset when the caller didn't specify one. - Emacs has its own code for code-conversion, for moving by characters through multibyte sequences, for producing a Unicode codepoint from a byte sequence in the super-UTF-8 representation and back, etc., so it doesn't use libc routines for that, and thus doesn't depend on the current locale for these operations. - APIs are provided for "manual" encoding and decoding. A Lisp program can read a byte stream, then decode it "manually" using a particular codeset, as deemed appropriate. This allows to handle complex situations where a program receives stuff whose encoding can only be determined by examining the raw byte stream (a typical example is a multipart email message with MIME charset header for each part). - Emacs also has tables of Unicode attributes of characters (produced by parsing the relevant Unicode data files at build time), so it can up/down-case characters, determine their category (letters, digits, punctuation, etc.) and script to which they belong, etc. -- all with its own code, independent of the underlying libc. This is no doubt a complex system that needs a lot of code. But it does work, and works well, as proven by years of experience. Nowadays at least some of the functionality can be found in free libraries which Guile could perhaps use, instead of rolling its own implementations. And the code used by Emacs is, of course, freely available for study and reuse. > At a minimum, I suggest Guile should produce an error by default > (instead of generating incorrect data) when the system bytes cannot be > encoded in the current locale. In our experience, this is a mistake. Signaling an error for each decoding problem produces unreliable applications that punt in too many cases. Emacs leaves the problematic bytes alone, as raw bytes (which are representable in the internal representation, see above), and leaves it to higher-level application code or to the user to deal with the results. The "generation of incorrect data" alternative is thus avoided, because Emacs does not replace undecodable bytes with something else. > As an incremental step, and as has been discussed elsewhere a bit, we > might add support for uselocale()[2] and then document that the current > recommendation is to always use ISO-8859-1 (i.e. Latin-1)[3] for system > data unless you're certain your program doesn't need to be general > purpose (perhaps you're sure you only care about UTF-8 systems). A Latin-1 locale comes with its baggage of rules, for example up- and down-casing, character classification (letters vs punctuation etc.), and other stuff. Representing raw bytes pretending they are Latin-1 characters is therefore problematic and will lead to programmatic errors, whereby a program cannot distinguish between a raw byte and a Latin-1 character that have the same 8-bit value. Feel free to ask any questions about the details. HTH