On Thu, Apr 05, 2007 at 12:54:54PM +0200, Marcin 'Qrczak' Kowalczyk wrote:
> Dnia 05-04-2007, czw o godzinie 02:04 -0400, Rich Felker napisał(a):
> 
> > Just look how much that already happens anyway... the use
> > of : as a separator in PATH-type strings, the use of spaces to
> > separate command line arguments, the use of = to separate environment
> > variable names from values, etc..
> 
> Do you propose to replace them with NULs? This would make no sense.

Of course not.

> A single environment variable can contain a whole PATH-type string.
> You can’t use NUL to delimit the whole string *and* its components
> at the same time. Different contexts require different delimiters
> if a string from one context is to be able to contain a sequence of
> another one.

My point is that the first level of in-band signalling is already
standardized, making for one less.

> > Having a character you know can't
> > occur in text (not just by arbitrary rules, but because it's actually
> > impossible for it to be passed in a C string) is nice because there's
> > at least one character you know is always safe to use for app-internal
> > in-band signalling.
> 
> Here you contradict yourself:

No. Inflammatory accusations like this are rather hasty and
inappropriate...

> > Notice also how GNU find/xargs use NUL to cleanly
> > separate filenames, relying on the fact that it could never occur
> > embedded in a filename.
> 
> because you show an example where NUL *is* used in text, and it’s used
> not internally but in communication between two programs.

That's not text. It's binary data containing a sequence of text
strings. The assumption that pipes==text is one of the most common
incorrect perceptions about unix, caused most likely by bad experience
with DOS pipes.

> > > The other languages handle all 256 byte values consistently.
> > 
> > Which ones?
> 
> All languages besides C, except toy interpreters written in C by some
> students.

False.

> > There are plenty of languages which can't handle control characters in
> > strings well at all, much less NUL.
> 
> I don’t know any such language.

sed, awk, bourne shell, ....

> > Because C was there first and C is essentially the only standardized
> > language.
> 
> Nonsense.

Like I said if you want to debate this email me off-list. It's quite
true, but mostly unrelated to the practical issues being discussed
here.

> > When your applications run on top of a system build upon C
> > and POSIX you have to play by the C and POSIX rules.
> 
> Only during communication with the system.
> 
> The only influence of C on string representation in other languages
> is that it’s common to redundantly have NUL stored after the string
> *in addition* to storing the length explicitly, so in cases the string
> doesn’t contain NUL itself it’s possible to pass the string to a C
> function without copying its contents.

This is bad design that leads to the sort of bugs seen in Firefox. If
we were living back in the 8bit codepage days, it might make sense for
these languages to try to unify byte arrays and character strings, but
we're not. There's no practical reason a character string needs to
store the NUL character (it's already not binary-clean due to UTF-8)
and thus no reason to introduce this blatent incompatibility (which
almost always turns into bugs and vulnerabilities) with the underlying
system.

Also note that there's nothing "backwards" about using termination
instead of length+data. For example it's the natural way a string
would be represented in a pure (without special string type) lisp-like
language. (Of course using a list is still binary clean because the
terminator is in the cdr rather than the car.) And like with lists, C
strings have the advantage that a terminal substring of the original
string is already a string in-place, without copying.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to