Re: perl unicode support [BACK ON-TOPIC]

Rich Felker Wed, 04 Apr 2007 23:07:19 -0700

On Wed, Apr 04, 2007 at 11:56:35PM -0400, Daniel B. wrote:
> Rich Felker wrote:
> ....
> >
> > Null termination is not the security problem. Broken languages that
> > DON'T use null-termination are the security problem, particularly
> > mixing them with C.
> 
> C is the language that handles one out of 256 possible byte values
> inconsistently (with respect to the other 255) (in C strings).


Having a standard designated byte that can be treated specially is
very useful in practice. If there weren't such a powerful force
establishing NUL as the one, we'd have all sorts of different
conventions. Just look how much that already happens anyway... the use
of : as a separator in PATH-type strings, the use of spaces to
separate command line arguments, the use of = to separate environment
variable names from values, etc.. Having a character you know can't
occur in text (not just by arbitrary rules, but because it's actually
impossible for it to be passed in a C string) is nice because there's
at least one character you know is always safe to use for app-internal
in-band signalling. Notice also how GNU find/xargs use NUL to cleanly
separate filenames, relying on the fact that it could never occur
embedded in a filename.

You can ask what would have happened if C had used pascal-style
strings. I suspect we would have been forced to deal with ridiculously
small length limits, controversial ABI changes to correct for it, etc.
Certainly for many types of applications its beneficial to use smarter
data structures for text internally (more complex even than just
pascal style strings), but I think C made a very good choice in using
the simplest possible representation for communicating reasonable-size
strings between the application, the system, and all the various
libraries that have followed the convention.

> The other languages handle all 256 byte values consistently.

Which ones? Now I think you're being hypocritical. One moment you're
applauding treating text as a sequence of Unicode codepoints in a way
that's not binary-clean for files containing invalid sequences, and
then you're complaining about C strings not being binary-clean because
NUL is a terminator. NUL is not text. Arguably other control
characters aside from newline (and perhaps tab) are not text either.
If you want to talk about binary data instead of text, then C isn't
doing anything inconsistent. The functions for dealing with binary
data (memcpy/memmove/memcmp/etc.) don't treat NUL specially of course.

There are plenty of languages which can't handle control characters in
strings well at all, much less NUL. I suspect most of the ones that
handle NUL the way you'd like them to also clobber invalid sequences
due to using UTF-16 internally.

> Why isn't it C that is a bit broken (that has irregular limitation)?  

Because C was there first and C is essentially the only standardized
language. When your applications run on top of a system build upon C
and POSIX you have to play by the C and POSIX rules. Ignoring this
necessity is what got Firefox burned.

Rich

P.S. If you really want to debate what I said about C being the only
standardized language/the authority/whatever, let's take it off-list
because we've gotten way off-topic from utf-8 handling already. I have
reasons for what I say, but I really don't want to burden this list
with more off-topic sub-thread spinoffs.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support [BACK ON-TOPIC]

Reply via email to