Re: utf-8 encoding scheme

H. Peter Anvin Fri, 21 Jul 2000 21:16:22 -0700
Followup to:  <[EMAIL PROTECTED]>
By author:    Henry Spencer <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
> 
> > This is incredibly important, since any misguided
> > attempt to "be liberal in what you accept" without addition of an
> > explicit canonicalization step would lead to the kind of security
> > holes that Microsoft web-related applications have been so full of...
> 
> There is somewhat of a contradiction here:  the usual place to "be liberal
> in what you accept" is a decoder, which has an inherent canonicalization
> step.  In practice, I think the only place where you make a meaningful
> *choice* as to whether to be liberal or not is when canonicalization is
> imminent anyway.  Someone who is comparing the raw UTF-8 sequences is
> making a mistake, yes, but it's not a "be liberal in what you accept"
> mistake, it's an error in choice of working representation. 
> 

It's not such a contradiction, actually.  If your decoder doesn
canonicalization, you absolutely must guarantee that the stream of
bits has not been in any way, shape, or form analyzed or filtered
prior to decoding.  This is usually an extremely difficult thing to
guarantee; in fact, the design of UTF-8 is done exactly so that common
text string operations such as (sub)string matching and sorting in
encoding order are still valid.  These guarantees *require* that
illegal encodings be rejected.  Thus, unless you are willing to do
some *serious* security analysis, being libral in what to accept is
probably a very bad idea in this particular context.

> > The alternate spelling
> >     11000001 10001011
> > ... is not the character K <U+004B> but INVALID SEQUENCE.  One
> > possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> > CHARACTER on encountering illegal sequences.
> 
> Unless you are Bill Gates and have the power to decree that your users
> *will* use your preferred decoder, this may be a mistake.  Remember that
> the users of a decoder see no advantage from this behavior, since they are
> canonicalizing anyway.
> 

Um... not so.  Bill Gates decreed that everyone should use Windows,
and Windows has security holes up the wazoo, *because* its encodings
are ambiguous, although not because of UTF-8.

The user of the decoder is the user that gets bitten by these security
holes, so I would *definitely* say the user of the decoder would see a
benefit from this behaviour!!

Implicit aliases are very dangerous.

         -hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: utf-8 encoding scheme

Reply via email to