RE: Roundtripping in Unicode

Arcane Jill Tue, 14 Dec 2004 05:19:46 -0800

I've been following this thread for a while, and I've pretty much got the hang of the issues here. To summarize:

Unix filenames consist of an arbitrary sequence of octets, excluding 0x00 and 0x2F. How they are /displayed/ to any given user depends on that user's locale setting. In this scenario, two users with different locale settings will see different filenames for the same file, but they will still be able to access the file via the filename that they see. These two filenames will be spelt identically in terms of octets, but (apparently) differently when viewed in terms of characters.

At least, that's how it was until the UTF-8 locale came along. If we consider only one-byte-per-character encodings, then any octet sequence is "valid" in any locale. But UTF-8 introduces the possibility that an octet sequence might be "invalid" - a new concept for Unix. So if you change your locale to UTF-8, then suddenly, some files created by other users might appear to you to have invalid filenames (though they would still appear valid when viewed by the file's creator).

A specific example: if a file F is accessed by two different users, A and B, of whom A has set their locale to Latin-1, and B has set their locale to UTF-8, then the filename may appear to be valid to user A, but invalid to user B.

Lars is saying (and he's probably right, because he knows more about Unix than I) that user B does not necessarily have the right to change the actual octet sequence which is the filename of F, just to make it appear valid to user B, because doing so would stop a lot of things working for user A (for instance, A might have created the file, the filename might be hardcoded in a script, etc.). So Lars takes a Unix-like approach, saying "retain the actual octet sequence, but feel free to try to display and manipulate it as if it were some UTF-8-like encoding in which all octet sequences are valid". And all this seems to work fine for him, until he tries to roundtrip to UTF-16 and back.

I'm not sure why anyone's arguing about this though - Phillipe's suggestion seems to be the perfect solution which keeps everyone happy. So...

...allow me to construct a specific example of what Phillipe suggested only generally:

DEFINITION - "NOT-Unicode" is the character repertoire consisting of the whole of Unicode, and 128 additional characters representing integers in the range 0x80 to 0xFF.

OBSERVATION - Unicode is a subset of NOT-Unicode

DEFINITION - "NOT-UTF-8" is a bidirectional encoding between a NOT-Unicode character stream and an octet stream, defined as follows: if a NOT-Unicode character is a Unicode character then its encoding is the UTF-8 encoding of that character; else the NOT-Unicode character must represent an integer, in which case its encoding is itself. To decode, assume the next NOT-Unicode character is a Unicode character and attempt to decode from the octet stream using UTF-8; if this fails then the NOT-Unicode character is an integer, in which case read one single octet from the stream and return it.

OBSERVATION - All possible octet sequences are valid NOT-UTF-8.

OBSERVATION - NOT-Unicode characters which are Unicode characters will be encoded identically in UTF-8 and NOT-UTF-8

OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot be represented in UTF-8

DEFINITION - "NOT-UTF-16" is a bidirectional encoding between a NOT-Unicode character stream and a 16-bit word stream, defined as follows: if a NOT-Unicode character is a Unicode character then its encoding is the UTF-16 encoding of that character; else the NOT-Unicode character must represent an integer, in which case its encoding is 0xDC00 plus the integer. To decode, if the next 16-bit word is in the range 0xDC80 to 0xDCFF then the NOT-Unicode character is the integer whose value is (word16 - 0xDC00), else the NOT-Unicode character is the Unicode character obtained by decoding as if UTF-16.

OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 -> NOT-UTF-16 -> NOT-UTF-8

OBSERVATION - NOT-Unicode characters which are Unicode characters will be encoded identically in UTF-16 and NOT-UTF-16

OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot be represented in UTF-16

DEFINITION - "NOT-UTF-32" is a bidirectional encoding between a NOT-Unicode character stream and a 32-bit word stream, defined as follows: if a NOT-Unicode character is a Unicode character then its encoding is the UTF-32 encoding of that character; else the NOT-Unicode character must represent an integer, in which case its encoding is 0x0000DC00 plus the integer. To decode, if the next 32-bit word is in the range 0x0000DC80 to 0x0000DCFF then the NOT-Unicode character is the octet whose value is (word32 - 0x0000DC00), else the NOT-Unicode character is the Unicode character obtained by decoding as if UTF-16.

OBSERVATION - Roundtripping is possible in the directions NOT-UTF-8 -> NOT-UTF-32 -> NOT-UTF-8 and NOT-UTF-16 -> NOT-UTF-32 -> NOT-UTF-16

OBSERVATION - NOT-Unicode characters which are Unicode characters will be encoded identically in UTF-32 and NOT-UTF-32

OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot be represented in UTF-32

This would appear to solve Lars' problem, and because the three encodings, NOT-UTF-8, NOT-UTF-16 and NOT-UTF-32, don't claim to be UTFs, no-one need get upset.

I /think/ that will work.
Jill

RE: Roundtripping in Unicode

Reply via email to