Marko Rauhamaa <ma...@pacujo.net> writes:

> Eli Zaretskii <e...@gnu.org>:
>
>> Btw, if by "UCS-2" you meant to say that only characters within the
>> BMP are supported in file names on Windows, then this is wrong
>
> No, I'm claiming Windows allows pathnames to contain isolated surrogate
> code points, which cannot be decoded back to Unicode with UTF-16.
>
> The situation is completely analogous to Linux pathnames that can
> contain illegal UTF-8.
>
>> : since Windows XP, NTFS volumes support file names with characters
>> outside of the BMP. I've just successfully created files with such
>> file names on Windows XP using Emacs.
>
> Both Windows and Linux filenames support all of Unicode. Trouble is,
> both of them support more than Unicode, making it impossible to use
> Guile's strings for an arbitrary filename.
>
> Python solves the problem by using a Unicode superset in its strings. I
> think that's misguided, and Guile is correct in sticking to Unicode.
>
> If I understood it correctly, someone just told us emacs maps illegal
> UTF-8 to another form of illegal UTF-8 and back. That's better in that
> it's bytes to bytes (leaving Unicode out), but it's not immediately
> obvious to me why you have to transform the byte sequence at all.

After the transformation, the resulting string satisfies the invariants
of the UTF-8 encoding scheme: characters are either a single byte in the
range 0x00 to 0x7f, or they are a sequence of a byte starting with n+1
high bits set and n bytes of value 0x80 to 0xbf following.

All string operators are able to access the individual _characters_ of
such strings (rather than bytes) by working with the general coding
scheme.  In addition, valid UTF-8 remains valid UTF-8 and all string
search/regexp operations work fine on it.  In fact, string search/regexp
operations even work with Emacs strings representing _invalid_ UTF-8
sequences.

The purpose of the transform is to _not_ have a flavorless chunk of
bytes but rather something organized into characters.  Of course this
only makes sense when _typically_ those characters correspond to valid
UTF-8.  If you know your string to be UCS-16, a different reencoding
would make sense, but preferably also one preserving both valid and
invalid content.

> Look at the problem of concatenation. We could have a case where two
> illegal UTF-8 (or UTF-16) snippets are concatenated to get valid UTF-8
> (or UTF-16).

Not possible in the internal Emacs representation.  Its character
boundaries are established on decoding.  Only after reencoding such a
concatenation, valid UTF-8 may fall out.  If you transfer text in blocks
of a given byte size, this may happen.  But concatenating in the decoded
character domain will not magically change the character count: you'll
retain two (or more) pieces of a split UTF-8 character.  If you want to
do processing on concatenated blocks, you better only decode after
concatenation.  Or reencode and redecode, but that seems wasteful.  But
it would work.

> That operation fails if you try to translate the snippets to strings
> before concatenation. Such concatenation operations are commonplace
> when dealing with filenames (eg, split(1)).

split(1) does not "deal with filenames" when splitting, but the
individual files may be split inside of UTF-8 sequences.  See above.

-- 
David Kastrup


Reply via email to