Re: Substituting malformed UTF-8 sequences in a decoder

Markus Kuhn Sun, 23 Jul 2000 16:33:13 -0700
Edmund GRIMLEY EVANS wrote on 2000-07-23 21:44 UTC:
> Markus Kuhn <[EMAIL PROTECTED]>:
> 
> > A) Emit a single U+FFFD per malformed sequence
> 
> We discussed this before. I can think of several ways of interpreting
> the phrase "malformed sequence".

ISO 10646-1 section R.7 is quite clear here.

> I think you probably mean either a single octet in the range 80..BF or
> a single octet in the range FE..FF

Yes.

> or an octet in the range C0..FD
> followed by any number of octets in the range 80..BF such that it
> isn't correct UTF-8 and isn't followed by another octet in the range
> 80..BF.

No. An octet in the range C0..FD followed by fewer octets in the range
80..BF than the number of leading 1 bits in the first octet is one
single malformed sequence (obviously trailing bytes are missing). An
octet in the range C0..FD followed by more octets in the range 80..BF
than the number of leading 1 bits in the first octet is a *correct*
UTF-8 sequence (except if it encodes a character for which there would
be a shorter code, or a surrogate), followed by a single malformed
sequence for each additional 80..BF byte.

Note that UTF-8 is self-terminating. This means that you cannot make a
valid UTF-8 sequence invalid by appending any additional bytes, and your
decoder must respect this. I read what you suggested correctly, then you
mean that appending 80 to a valid UTF-8 sequence will make it invalid,
which clearly is not a sensible semantics of a decoder as it does not
honor the self-terminating property of UTF-8.

> This is probably quite hard to implement consistently, and, as with
> semantics C, the UTF-8/UTF-16 length ratio is unbounded, which means
> in particular that you can't decode from a fixed-size buffer in the
> manner of mbrtowc.

No, it is not. A malformed sequence can never be longer than the longest
correct sequence, namely 6 bytes.

[Option B)]
> But you have to ask yourself: do I reset the mbstate_t when I replace
> a bad byte by U+FFFD? If you want consistency, you probably should, as
> otherwise the mbstate_t is undefined after mbrtowc gives EILSEQ.

Of course.

> > D) Emit a malformed UTF-16 sequence for every byte in a malformed
> >    UTF-8 sequence
>
> Not much good if you're not converting to UTF-16.

No. Note that you do not have to actually convert to UTF-16 to make use
of this technique. The exact same trick works also with UCS-2, UCS-4,
etc.! It is just more educational to explain it in terms of UTF-16,
because then it becomes very clear, why mapping bytes of malformed
sequences onto U+DC80 .. U+DCFF is a particularly good choice of error
codes, since is does not collide with anything that even a UTF-8 ->
UTF-16 decoder could produce normally.

The entire technique itself is completely independent of UTF-16!

> So perhaps B should be the generally recommended way.

What's wrong with D)? Think about it again and don't get confused just
because I mentioned UTF-16 ...

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder

Reply via email to