RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Lars Kristan Sat, 11 Dec 2004 09:53:15 -0800

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Kenneth Whistler wrote:
> Further, as it turns out that Lars is actually asking for
> "standardizing" corrupt UTF-8, a notion that isn't going to
> fly even two feet, I think the whole idea is going to be
> a complete non-starter.

Technically, I am not asking anything. I am just trying to discuss an approach which I think can be used to solve certain problems. And this approach does not need to be conformant at this point. If someone finds it suitable to make it conformant, even better, but at this point this is irrelevant to the discussion. Unless it is proven that it cannot be made conformant (by changing or amending the standard) because I have missed an important fact. But so far, I have not seen such a proof.

But suppose I am asking, therefore proposing - it would be several separate items:

1 - To assign codepoints for 128 (or 256) new surrogates(*), used for:
1.1 - Representing unassigned values when converting from an encoding to Unicode (optional).
1.2 - Representing invalid sequences when interpreting UTF-8 (optional).
The use of these would not be mandatory. Existing handling is still an option and can be preserved wherever it suits the needs, or changed where the new behavior is beneficial.

Representation of these codepoints in UTF-8 would be as per current standard.

2 - An alternative conversion from Unicode, to, say, UTF-8E (UTF-8E is _NOT_ Unicode(*)).
This conversion would reconstruct the original byte sequence, from a Unicode string obtained by 1.2. This conversion pair intended for use on platform or interface boundaries if/where it is determined that they are suitable. For example, interfacing UNIX filesystem and a UTF-8 pipe would require UTF-8E<=>UTF-8 conversion. Interfacing UNIX filesystem and Windows filesystem would require UTF-8E<=>UTF-16 conversion.

(*) If proposal #2 would not be accepted, then codepoints in proposal #1 would actually not be surrogates, but simply codepoints and nothing else. Even if proposal #2 is accepted, it is still not clear if those should really be called surrogates, since they would convert among all UTF's just as any other codepoint and only their representation in UTF-8E would differ. Note that UTF-8E is not Unicode, but would be standardized in Unicode. IF U in UTF is a problem, then any other name can be chosen. Consider it a working name and be aware of what it is and is not.

3 - If UTC cannot agree that BMP should be used for proposal #1, I would advise against a decision to assign non-BMP codepoints for the purpose. I believe less damage would be done by postponing the decision than by making a wrong decision. It is not just about how much disk space or bandwidth is used. For example, if both filesystems have a 256 characters limit for a filename, limitations are consistent (at least in one direction) if BMP is used, and not if any other plane is used.

4 - If neither of the proposals is accepted, it would be beneficial if UTC would manage to preserve at least one suitable block (for example U+A4xx or U+ABxx) of 256 codepoints intact to facilitate a future decision.

Lars Kristan

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Reply via email to