Re: UTF-8(4) versus UTF-8(6) security issues

Keld Jďż˝rn Simonsen Sat, 17 Jan 2004 13:28:04 -0800

On Sat, Jan 17, 2004 at 02:27:43PM -0500, Edward H Trager wrote:
> 
> 
> 
> On Sat, 17 Jan 2004, Markus Kuhn wrote:
> 
> > [EMAIL PROTECTED] wrote on 2004-01-11 16:53 UTC:
> > >     > That is not necessarily good advice in security issues.
> > >
> > >     What harm can it be? It will not be characters that are relevant in any
> > >     syntactical analyses.
> > >
> > > Consider: parser 1 knows that a UTF-8 sequence can have
> > > at most 6 bytes, and sees an illegal 5-byte sequence.
> > >
> > > Parser 2 knows that a UTF-8 sequence can have at most
> > > 4 bytes, and sees an illegal 4-byte sequence followed by
> > > an ASCII symbol.
> > >
> > > Difference in interpretation of a byte sequence always has
> > > security implications.
> >
> > Your example does not work, because an ASCII byte must always
> > resynchronize the decoder and be recognized as an ASCII character,
> > completely independent of whether the decoder knows about the existance
> > of 6-byte UTF-8 sequences or treats all bytes in the range 0xf8..0xff as
> > illegal. Bytes in the range 0x00..0x7f cannot be part of a malformed
> > UTF-8 sequence.
> >
> > I have yet to see a scenario where the difference between 4-byte and
> > 6-byte UTF-8 decoder could lead to a plausible security risk and I don't
> > believe that one is easy to construct or likely to happen.
> >
> 
> Hi, Markus,
> 
> Then I assume you would advocate that UTF-8 encoders/decoders (for example
> for Linux) be written to handle all 6 bytes, not just the four which is
> probably the case now?


Well, probably many UTF-8 encoders/decoders are 6 bytes, as it is only
recently that the 4 byte restriction was defined for Unicode UTF-8.
ISO 10646 UTF-8 was 6 bytes from its introduction about 10 years ago.

Best regards
Keld

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: UTF-8(4) versus UTF-8(6) security issues

Reply via email to