RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Lars Kristan Sat, 11 Dec 2004 04:06:12 -0800

Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

John Cowan wrote:
> However, although they are *technically* octet sequences, they
> are *functionally* character strings. That's the issue.
Nicely put! But UTC does not seem to care.

>
> > The point I'm making is that *whatever* you do, you are still
> > asking for implementers to obey some convention on conversion
> > failures for corrupt, uninterpretable character data.
> > My assessment is that you'd have no better success at making
> > this work universally well with some set of 128 magic bullet
> > corruption pills on Plane 14 than you have with the
> > existing Quoted-Unprintable as a convention.
>
> It doesn't have to work universally; indeed, it becomes a QOI issue.
> Allocating representations of bytes with "bits that are high" makes
> it possible to do something recoverable, at very little expense to the
> Unicode Consortium.
Except that the expense should be slightly higher. The importance of these replacement codepoints is still underestimated. They belong in the BMP. And at least there is no way anyone can blame UTC for a cultural bias in this case, these codepoints are universal.

>
> > Further, as it turns out that Lars is actually asking for
> > "standardizing" corrupt UTF-8, a notion that isn't going to
> > fly even two feet, I think the whole idea is going to be
> > a complete non-starter.
>
> I agree that that part won't fly, absolutely.
Then I'll have to restructure it.

Lars

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

Reply via email to