Hi, I'm probably going to get a lot of heat for this, but... At the current point of time there is sort of a discussion raging on (or by now raging off) regarding string and transfer encoding strategies, and that is partly based on different philosophies between Emacs and GUILE.
Now GUILE internally employs libunistring, so its design decisions obviously strongly favor possibilities available in libunistring. Looking at the libunistring documentation, I find that the available options actually do not make Emacs-like behavior feasible in an efficient manner. Now the essence of the situation boils down to the most common encodings of Emacs being round-trippable: if I load a file declared to be "utf-8", I can save it as "utf-8", and even if it is straight from /dev/random, it will reproduce the original byte sequence regardless of whether it constitutes valid utf-8 or not. The way this is done in multibyte strings is that any stray byte not being part of a valid minimal utf-8 sequence (such bytes can take the values 128-255 as ASCII characters are always valid) is encoded into a 2-byte overlong representation of character codes 0-127, namely 0xc0 0x80 to 0xc1 0xbf. So the growth factor of a file is limited. When interpreted as character codes, these patterns representing single bytes are outside of the range covered by Unicode (starting at 0x3fff80, actually). Emacs does actually support character codes in that range: 0x3fff00 is still encodable and represented with the byte pattern 0xf8 0x8f 0xbf 0xbc 0x80 namely a 5-byte sequence in the basic UTF-8 encoding scheme. I think that the Emacs character set ends with the last 128 characters encoded as 2-byte sequences. Emacs uses the extended character ranges beyond Unicode to represent various Asian character sets that are, according to users of those character sets, not adequately represented in Unicode. Now the particular extended character range is something that is quite particular to Emacs. What I am actually more interested in is in having libunistring offer "roundtrippable" encodings as a fallback for decoding errors. Basically, I want an option for decoding where libunistring announces "what you have here is not valid utf-8 but I know how to deal with it". Including reencoding. And delivering unique "character codes" and string length calculations. The application would either keep track of having received "dirty utf-8" and would reencode when putting out utf-8 (where reencoding "internal utf-8" to "external utf-8" means replacing the 2-byte sequences representing a wild byte by their original byte), or it would reencode into "external" utf-8 when writing anyway which would not change anything for originally valid utf-8. The basic point would be to be able to process any input assuming a specified locale with graceful degradation where the locale assumption is violated. For example, a regular expression replacement of text from a mixed text/binary file (like PostScript often is) without affecting the binary passages. Not requiring a latin-1 interpretation of the input in order to be able to use the internal UTF-8 encoding based string processing for lossless input processing from a file or a terminal or network connection or other sources provides additional flexibility for an application using libunistring. The support would basically come in 3 parts: a) decoding and encoding strategies that allow "escape code" representation of raw bytes not fitting into regular UTF-8. b) a unique character code returned when converting into a character code c) guarantees about the processing of those sequences that are most likely already met since one can fit them into the normal patterns of UTF-8 encoding reasonably well. Character ranges in regular expressions, upper- and lowercasing and some other operations strongly related to character code points would likely require checking and possibly changes. Now I cannot vouch for the actual interest of GUILE developers in roundtripping coding system and/or conversions. I suspect this also to be a hen-and-egg situation where the availability in libunistring will change the perception of desirability. Independent of the potential use in GUILE, I would think that this sort of functionality is desirable whenever one is not integrating libunistring in a basic text processing application but rather a programming platform. In that case, an internal representation that can reflect arbitrary input accurately even when basically interpreted as utf-8 seems like a definite advantage to me. Thoughts? -- David Kastrup
