On Wed, Apr 04, 2007 at 11:56:35PM -0400, Daniel B. wrote: > Rich Felker wrote: > .... > > > > Null termination is not the security problem. Broken languages that > > DON'T use null-termination are the security problem, particularly > > mixing them with C. > > C is the language that handles one out of 256 possible byte values > inconsistently (with respect to the other 255) (in C strings).
Having a standard designated byte that can be treated specially is very useful in practice. If there weren't such a powerful force establishing NUL as the one, we'd have all sorts of different conventions. Just look how much that already happens anyway... the use of : as a separator in PATH-type strings, the use of spaces to separate command line arguments, the use of = to separate environment variable names from values, etc.. Having a character you know can't occur in text (not just by arbitrary rules, but because it's actually impossible for it to be passed in a C string) is nice because there's at least one character you know is always safe to use for app-internal in-band signalling. Notice also how GNU find/xargs use NUL to cleanly separate filenames, relying on the fact that it could never occur embedded in a filename. You can ask what would have happened if C had used pascal-style strings. I suspect we would have been forced to deal with ridiculously small length limits, controversial ABI changes to correct for it, etc. Certainly for many types of applications its beneficial to use smarter data structures for text internally (more complex even than just pascal style strings), but I think C made a very good choice in using the simplest possible representation for communicating reasonable-size strings between the application, the system, and all the various libraries that have followed the convention. > The other languages handle all 256 byte values consistently. Which ones? Now I think you're being hypocritical. One moment you're applauding treating text as a sequence of Unicode codepoints in a way that's not binary-clean for files containing invalid sequences, and then you're complaining about C strings not being binary-clean because NUL is a terminator. NUL is not text. Arguably other control characters aside from newline (and perhaps tab) are not text either. If you want to talk about binary data instead of text, then C isn't doing anything inconsistent. The functions for dealing with binary data (memcpy/memmove/memcmp/etc.) don't treat NUL specially of course. There are plenty of languages which can't handle control characters in strings well at all, much less NUL. I suspect most of the ones that handle NUL the way you'd like them to also clobber invalid sequences due to using UTF-16 internally. > Why isn't it C that is a bit broken (that has irregular limitation)? Because C was there first and C is essentially the only standardized language. When your applications run on top of a system build upon C and POSIX you have to play by the C and POSIX rules. Ignoring this necessity is what got Firefox burned. Rich P.S. If you really want to debate what I said about C being the only standardized language/the authority/whatever, let's take it off-list because we've gotten way off-topic from utf-8 handling already. I have reasons for what I say, but I really don't want to burden this list with more off-topic sub-thread spinoffs. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
