Marko Rauhamaa <[email protected]> writes: > Eli Zaretskii <[email protected]>: > >> Btw, if by "UCS-2" you meant to say that only characters within the >> BMP are supported in file names on Windows, then this is wrong > > No, I'm claiming Windows allows pathnames to contain isolated surrogate > code points, which cannot be decoded back to Unicode with UTF-16. > > The situation is completely analogous to Linux pathnames that can > contain illegal UTF-8. > >> : since Windows XP, NTFS volumes support file names with characters >> outside of the BMP. I've just successfully created files with such >> file names on Windows XP using Emacs. > > Both Windows and Linux filenames support all of Unicode. Trouble is, > both of them support more than Unicode, making it impossible to use > Guile's strings for an arbitrary filename. > > Python solves the problem by using a Unicode superset in its strings. I > think that's misguided, and Guile is correct in sticking to Unicode. > > If I understood it correctly, someone just told us emacs maps illegal > UTF-8 to another form of illegal UTF-8 and back. That's better in that > it's bytes to bytes (leaving Unicode out), but it's not immediately > obvious to me why you have to transform the byte sequence at all.
After the transformation, the resulting string satisfies the invariants of the UTF-8 encoding scheme: characters are either a single byte in the range 0x00 to 0x7f, or they are a sequence of a byte starting with n+1 high bits set and n bytes of value 0x80 to 0xbf following. All string operators are able to access the individual _characters_ of such strings (rather than bytes) by working with the general coding scheme. In addition, valid UTF-8 remains valid UTF-8 and all string search/regexp operations work fine on it. In fact, string search/regexp operations even work with Emacs strings representing _invalid_ UTF-8 sequences. The purpose of the transform is to _not_ have a flavorless chunk of bytes but rather something organized into characters. Of course this only makes sense when _typically_ those characters correspond to valid UTF-8. If you know your string to be UCS-16, a different reencoding would make sense, but preferably also one preserving both valid and invalid content. > Look at the problem of concatenation. We could have a case where two > illegal UTF-8 (or UTF-16) snippets are concatenated to get valid UTF-8 > (or UTF-16). Not possible in the internal Emacs representation. Its character boundaries are established on decoding. Only after reencoding such a concatenation, valid UTF-8 may fall out. If you transfer text in blocks of a given byte size, this may happen. But concatenating in the decoded character domain will not magically change the character count: you'll retain two (or more) pieces of a split UTF-8 character. If you want to do processing on concatenated blocks, you better only decode after concatenation. Or reencode and redecode, but that seems wasteful. But it would work. > That operation fails if you try to translate the snippets to strings > before concatenation. Such concatenation operations are commonplace > when dealing with filenames (eg, split(1)). split(1) does not "deal with filenames" when splitting, but the individual files may be split inside of UTF-8 sequences. See above. -- David Kastrup
