Dnia 05-04-2007, czw o godzinie 11:32 -0400, Rich Felker napisał(a): > My point is that the first level of in-band signalling is already > standardized, making for one less.
The issue was whether NUL characters should be excluded from the string type in a programming language, and whether the internal representation of strings should rely on NUL as the terminator (as opposed to storing the length separately). I claim that it would be a bad idea. The interface of working with strings can be as convenient when NUL is not excluded, and there are cases where programs deal with strings containing NULs so excluding them is harmful. Even of NUL is special in some OS API, it’s not a reason to make it special in core string handling. > > > There are plenty of languages which can't handle control characters in > > > strings well at all, much less NUL. > > > > I don’t know any such language. > > sed, awk, bourne shell, .... True for mawk, bash, and ksh, but not true for GNU sed, gawk, and zsh, which are capable of storing NULs in user strings, and these NULs are correctly passed to and processed by internal shell commands. Anyway, these are exceptions rather than a rule. > > The only influence of C on string representation in other languages > > is that it’s common to redundantly have NUL stored after the string > > *in addition* to storing the length explicitly, so in cases the string > > doesn’t contain NUL itself it’s possible to pass the string to a C > > function without copying its contents. > > This is bad design that leads to the sort of bugs seen in Firefox. If > we were living back in the 8bit codepage days, it might make sense for > these languages to try to unify byte arrays and character strings, but > we're not. This is another issue. I’m for distinguishing character strings from byte strings. I’m against making U+0000 or 0 a special case in either of them. For example in my language Kogut a string is a sequence of Unicode code points. My implementation uses two string representations internally: if it contains no characters above U+00FF, then it’s stored as a sequence of bytes, otherwise it’s a sequence of 32-bit integers. This variation is not visible in the language. The narrow case has a redundant NUL appended. When a string is passed to some C function and the function expects the default encoding (normally taken from the locale), then — under the assumption that a default encoding is ASCII-compatible — if the string contains only ASCII characters excluding NUL, a pointer to the string data is passed. Otherwise a recoded array of bytes is created. This is quite a practical reason to store the redundant NULs, even though NUL is not special as far as the string type is concerned. Most strings manipulated by average programs are ASCII-only. > Also note that there's nothing "backwards" about using termination > instead of length+data. For example it's the natural way a string > would be represented in a pure (without special string type) lisp-like > language. (Of course using a list is still binary clean because the > terminator is in the cdr rather than the car.) The parenthesized remark is crucial. Lisp lists use an out-of-band terminator, not in-band. > And like with lists, C > strings have the advantage that a terminal substring of the original > string is already a string in-place, without copying. This is too small advantage to overcome the inability of storing NULs and the lack of O(1) length check (which rules out bounds checking on indexing), and it’s impractical with garbage collection anyway. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/