Marko Rauhamaa <ma...@pacujo.net> writes:
> Eli Zaretskii <e...@gnu.org>:
>> Btw, if by "UCS-2" you meant to say that only characters within the
>> BMP are supported in file names on Windows, then this is wrong
> No, I'm claiming Windows allows pathnames to contain isolated surrogate
> code points, which cannot be decoded back to Unicode with UTF-16.
> The situation is completely analogous to Linux pathnames that can
> contain illegal UTF-8.
>> : since Windows XP, NTFS volumes support file names with characters
>> outside of the BMP. I've just successfully created files with such
>> file names on Windows XP using Emacs.
> Both Windows and Linux filenames support all of Unicode. Trouble is,
> both of them support more than Unicode, making it impossible to use
> Guile's strings for an arbitrary filename.
> Python solves the problem by using a Unicode superset in its strings. I
> think that's misguided, and Guile is correct in sticking to Unicode.
> If I understood it correctly, someone just told us emacs maps illegal
> UTF-8 to another form of illegal UTF-8 and back. That's better in that
> it's bytes to bytes (leaving Unicode out), but it's not immediately
> obvious to me why you have to transform the byte sequence at all.
After the transformation, the resulting string satisfies the invariants
of the UTF-8 encoding scheme: characters are either a single byte in the
range 0x00 to 0x7f, or they are a sequence of a byte starting with n+1
high bits set and n bytes of value 0x80 to 0xbf following.
All string operators are able to access the individual _characters_ of
such strings (rather than bytes) by working with the general coding
scheme. In addition, valid UTF-8 remains valid UTF-8 and all string
search/regexp operations work fine on it. In fact, string search/regexp
operations even work with Emacs strings representing _invalid_ UTF-8
The purpose of the transform is to _not_ have a flavorless chunk of
bytes but rather something organized into characters. Of course this
only makes sense when _typically_ those characters correspond to valid
UTF-8. If you know your string to be UCS-16, a different reencoding
would make sense, but preferably also one preserving both valid and
> Look at the problem of concatenation. We could have a case where two
> illegal UTF-8 (or UTF-16) snippets are concatenated to get valid UTF-8
> (or UTF-16).
Not possible in the internal Emacs representation. Its character
boundaries are established on decoding. Only after reencoding such a
concatenation, valid UTF-8 may fall out. If you transfer text in blocks
of a given byte size, this may happen. But concatenating in the decoded
character domain will not magically change the character count: you'll
retain two (or more) pieces of a split UTF-8 character. If you want to
do processing on concatenated blocks, you better only decode after
concatenation. Or reencode and redecode, but that seems wasteful. But
it would work.
> That operation fails if you try to translate the snippets to strings
> before concatenation. Such concatenation operations are commonplace
> when dealing with filenames (eg, split(1)).
split(1) does not "deal with filenames" when splitting, but the
individual files may be split inside of UTF-8 sequences. See above.