On Thu, Apr 05, 2007 at 12:54:54PM +0200, Marcin 'Qrczak' Kowalczyk wrote: > Dnia 05-04-2007, czw o godzinie 02:04 -0400, Rich Felker napisał(a): > > > Just look how much that already happens anyway... the use > > of : as a separator in PATH-type strings, the use of spaces to > > separate command line arguments, the use of = to separate environment > > variable names from values, etc.. > > Do you propose to replace them with NULs? This would make no sense.
Of course not. > A single environment variable can contain a whole PATH-type string. > You can’t use NUL to delimit the whole string *and* its components > at the same time. Different contexts require different delimiters > if a string from one context is to be able to contain a sequence of > another one. My point is that the first level of in-band signalling is already standardized, making for one less. > > Having a character you know can't > > occur in text (not just by arbitrary rules, but because it's actually > > impossible for it to be passed in a C string) is nice because there's > > at least one character you know is always safe to use for app-internal > > in-band signalling. > > Here you contradict yourself: No. Inflammatory accusations like this are rather hasty and inappropriate... > > Notice also how GNU find/xargs use NUL to cleanly > > separate filenames, relying on the fact that it could never occur > > embedded in a filename. > > because you show an example where NUL *is* used in text, and it’s used > not internally but in communication between two programs. That's not text. It's binary data containing a sequence of text strings. The assumption that pipes==text is one of the most common incorrect perceptions about unix, caused most likely by bad experience with DOS pipes. > > > The other languages handle all 256 byte values consistently. > > > > Which ones? > > All languages besides C, except toy interpreters written in C by some > students. False. > > There are plenty of languages which can't handle control characters in > > strings well at all, much less NUL. > > I don’t know any such language. sed, awk, bourne shell, .... > > Because C was there first and C is essentially the only standardized > > language. > > Nonsense. Like I said if you want to debate this email me off-list. It's quite true, but mostly unrelated to the practical issues being discussed here. > > When your applications run on top of a system build upon C > > and POSIX you have to play by the C and POSIX rules. > > Only during communication with the system. > > The only influence of C on string representation in other languages > is that it’s common to redundantly have NUL stored after the string > *in addition* to storing the length explicitly, so in cases the string > doesn’t contain NUL itself it’s possible to pass the string to a C > function without copying its contents. This is bad design that leads to the sort of bugs seen in Firefox. If we were living back in the 8bit codepage days, it might make sense for these languages to try to unify byte arrays and character strings, but we're not. There's no practical reason a character string needs to store the NUL character (it's already not binary-clean due to UTF-8) and thus no reason to introduce this blatent incompatibility (which almost always turns into bugs and vulnerabilities) with the underlying system. Also note that there's nothing "backwards" about using termination instead of length+data. For example it's the natural way a string would be represented in a pure (without special string type) lisp-like language. (Of course using a list is still binary clean because the terminator is in the cdr rather than the car.) And like with lists, C strings have the advantage that a terminal substring of the original string is already a string in-place, without copying. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
