(i'm not on python-dev, so i dunno whether this will make it through...) basically, this bug does not affect the vast majority (mac and windows users with UTF-16 "narrow" unicode Python builds) because the unpatched code allocates sufficient memory in this case. only the minority treating this as a serious vulnerability (linux users with UTF-32 "wide" unicode Python builds, possibly some other Unix-like operating systems too) are affected by the buffer overrun.
as for secunia, they need to do their own homework ;) i found this bug and wrote the patch that's been applied by the linux distros, so i thought i should clear up a couple of apparent misconceptions. please pardon me if i'm writing stuff you already know... the bug concerns allocation in repr() for unicode objects. previously repr() always allocated 6 bytes in the output buffer per input unicode string element; this is enough for the six-byte "\uffff" notation and on UTF-16 python builds enough for the ten-byte "\U0010ffff" notation, since on UTF-16 python builds the input unicode string contains a surrogate pair (two consecutive elements) to represent unicode characters requiring this longer notation, meaning five bytes per element. however on UTF-32 builds ten bytes per unicode string element are needed, and this is what the patch accomplishes. the previous (incorrect) algorithm extended the buffer by 100 bytes in some cases when encountering such a character, however this fixed-size heuristic extension fails when the string contains many subsequent characters in the six-byte "\uffff" form, as demonstrated by this test which will fail in an unpatched non-debug wide python build: python2.4 -c 'assert(repr(u"\U00010000" * 39 + u"\uffff" * 4096)) == (repr(u"\U00010000" * 39 + u"\uffff" * 4096))' yes, a sufficiently motivated person could probably discover enough about the memory layout of a process to use this for data or code injection, but the more usual (and sometimes accidental) consequence is a crash. more background: python comes in two flavors, UTF-16 ("narrow") and UTF-32 ("wide"), depending on whether the unicode chars are represented. This is generally configured to match the C library's wchar_t. UTF-16: Windows (at least 32-bit builds), Mac OS X (at least 32-bit builds), probably others too -- this uses a 16-bit variable-length encoding for Unicode characters: 1 16-bit word for U+0000 ... U+FFFF (identity mapped to 0x0000 ... 0xffff resp., a.k.a. the "UCS-2" range or Basic Multilingual Plane) and 2 16-bit words for U+00010000 ... U +0010FFFF (mapped as "surrogate pairs" to 0xd800; 0xdc00 ... 0xdbff; 0xdfff resp., corresponding to planes 1 through 16.) UTF-32/UCS-4: Linux, possibly others? -- this uses 1 32-bit word per unicode character: 1 word for all codepoints allowed by Python U +0000 ... U+0010FFFF (identity mapped to 0x00000000L ... 0x0010ffffL resp.) > On 10/7/06, skip[at]pobox.com <skip[at]pobox.com> wrote: > > > > Georg> [ Bug http://python.org/sf/1541585 ] > > > > Georg> This seems to be handled like a security issue by linux > > Georg> distributors, it's also a news item on security related > pages. > > > > Georg> Should a security advisory be written and official patches > be > > Georg> provided? > > > > I asked about this a few weeks ago. I got no direct response. > Secunia sent > > mail to webmaster and the SF project admins asking about how this > could be > > exploited. (Isn't figuring that stuff out their job?) > > FWIW, I responded to the original mail from Secunia with what little > I > know about the problem. Everyone on the original mail was copied. > However, I got ~30 bounces for all the Source Forge addresses due to > some issue between SF and Google mail. > > n _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com