Re: Substituting malformed UTF-8 sequences in a decoder

Markus Kuhn Mon, 24 Jul 2000 03:45:24 -0700
Edmund GRIMLEY EVANS wrote on 2000-07-24 09:55 UTC:
> So C0 80 is one malformed sequence, and F0 80 80 80 80 is two
> malformed sequences, yes?

Yes. That was what I meant in option A).

> But I still can't use library functions to do this, and I don't see
> the advantage over plain and simple B.

As I said, it was my interpretation of the words of ISO, and it happened
to be slightly more easy to implement in the specific situation of the
xterm decoder. I do agree that in general, options B and D (which are
compatible with respect to the number of characters decoded!) are
probably what is going to win in the end.

> > > > D) Emit a malformed UTF-16 sequence for every byte in a malformed
> > > >    UTF-8 sequence
> > >
> > > Not much good if you're not converting to UTF-16.
> > 
> > No. Note that you do not have to actually convert to UTF-16 to make use
> > of this technique. The exact same trick works also with UCS-2, UCS-4,
> > etc.! It is just more educational to explain it in terms of UTF-16,
> > because then it becomes very clear, why mapping bytes of malformed
> > sequences onto U+DC80 .. U+DCFF is a particularly good choice of error
> > codes, since is does not collide with anything that even a UTF-8 ->
> > UTF-16 decoder could produce normally.
> 
> But aren't ED B2 80 and 80 both mapped to U+DC80, which are both then
> mapped back to 80, changing correct UTF-8 into malformed UTF-8?

No. Remember that I wrote also:

> > If we simply make sure that every UTF-8 encoded surrogate
> > character is also treated like a malformed sequence, then there is no
> > way that a single high-half surrogate could precede the encoded
> > malformed sequence and cause a valid UTF-16 sequence to emerge.

Consequently, ED B2 80 must be treated as a malformed UTF-8 sequence,
which in my proposal means that it will be mapped to U+DCED U+DCB2
U+DC80, while a single 80 will just be mapped to U+DC80. The scheme as I
described it is 100% binary transparent. You will not find a byte
sequence that does not survive the UTF-8 -> UTF-16 -> UTF-8 roundtrip
without any loss of information.

ISO 10646-1 sanctions such a technique in principle, because it says in
section R.4 at least:

  NOTE 3 - Values of x in the range 0000D800 .. 0000DFFF are reserved for the
  UTF-16 form and do not occur in UCS-4. The values 0000FFFE and 0000FFFF also
  do not occur (see clause 8). The mappings of these code positions in UTF-8 are
  undefined.

So all UTF-8 sequences that represent any of these UCS-4 values can and
should be treated like malformed sequences. There are far more potential
security loopholes to be suppressed here. For example, UTF-8 encoded
U+FFFE must be supressed in a decoder, because it could be used by an
attacker to create havoc with a false BOM in some applications, and
U+FFFF similarly might have special applications.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Substituting malformed UTF-8 sequences in a decoder

Reply via email to