>>>Markus Kuhn said:
> 4) I also noted that tclUtf:Tcl_UtfToUniChar accepts overlong UTF-8
> sequences. This can be a security vulnerability and is forbidden in
> Unicode 3.1. Practical example: a secure UTF-8 decoder must NOT accept
any o
f
>
> 0xc0 0x8A
> 0xe0 0x80 0x8A
> 0xf0 0x80 0x80 0x8A
> 0xf8 0x80 0x80 0x80 0x8A
> 0xfc 0x80 0x80 0x80 0x80 0x8A
>
> as a valid encoding for U+000a, otherwise this could be used by
> attackers to bypass ASCII-level integrity checks (e.g. string must me a
> single line because it contains no 0x0a) before the UTF-8 decoder.
Tcl does its best to accept anything, but produce only shortest-form
output. The one special case is embedded nulls (0x0000), where Tcl
produces 0xC0 0x80 in order to avoid possible null-termination problems
with non-UTF aware code. It probably wouldn't break anything to to
disallow non-shortest form UTF-8 for all but this one case. If you
eliminate the 0xc080 case, you'll have to check to make sure *everything*
is length encoded.
--Scott
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/