Re: perl unicode support [BACK OFF-TOPIC]

Marcin 'Qrczak' Kowalczyk Sat, 07 Apr 2007 05:11:51 -0700

Dnia 05-04-2007, czw o godzinie 11:32 -0400, Rich Felker napisał(a):

> My point is that the first level of in-band signalling is already
> standardized, making for one less.


The issue was whether NUL characters should be excluded from the string
type in a programming language, and whether the internal representation
of strings should rely on NUL as the terminator (as opposed to storing
the length separately).

I claim that it would be a bad idea. The interface of working with
strings can be as convenient when NUL is not excluded, and there are
cases where programs deal with strings containing NULs so excluding them
is harmful. Even of NUL is special in some OS API, it’s not a reason to
make it special in core string handling.

> > > There are plenty of languages which can't handle control characters in
> > > strings well at all, much less NUL.
> > 
> > I don’t know any such language.
> 
> sed, awk, bourne shell, ....

True for mawk, bash, and ksh, but not true for GNU sed, gawk, and zsh,
which are capable of storing NULs in user strings, and these NULs are
correctly passed to and processed by internal shell commands.

Anyway, these are exceptions rather than a rule.

> > The only influence of C on string representation in other languages
> > is that it’s common to redundantly have NUL stored after the string
> > *in addition* to storing the length explicitly, so in cases the string
> > doesn’t contain NUL itself it’s possible to pass the string to a C
> > function without copying its contents.
> 
> This is bad design that leads to the sort of bugs seen in Firefox. If
> we were living back in the 8bit codepage days, it might make sense for
> these languages to try to unify byte arrays and character strings, but
> we're not.

This is another issue. I’m for distinguishing character strings from
byte strings. I’m against making U+0000 or 0 a special case in either
of them.

For example in my language Kogut a string is a sequence of Unicode code
points. My implementation uses two string representations internally:
if it contains no characters above U+00FF, then it’s stored as a
sequence of bytes, otherwise it’s a sequence of 32-bit integers.
This variation is not visible in the language. The narrow case has
a redundant NUL appended. When a string is passed to some C function
and the function expects the default encoding (normally taken from
the locale), then — under the assumption that a default encoding
is ASCII-compatible — if the string contains only ASCII characters
excluding NUL, a pointer to the string data is passed. Otherwise
a recoded array of bytes is created. This is quite a practical reason
to store the redundant NULs, even though NUL is not special as far as
the string type is concerned. Most strings manipulated by average
programs are ASCII-only.

> Also note that there's nothing "backwards" about using termination
> instead of length+data. For example it's the natural way a string
> would be represented in a pure (without special string type) lisp-like
> language. (Of course using a list is still binary clean because the
> terminator is in the cdr rather than the car.)

The parenthesized remark is crucial. Lisp lists use an out-of-band
terminator, not in-band.

> And like with lists, C
> strings have the advantage that a terminal substring of the original
> string is already a string in-place, without copying.

This is too small advantage to overcome the inability of storing NULs
and the lack of O(1) length check (which rules out bounds checking on
indexing), and it’s impractical with garbage collection anyway.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support [BACK OFF-TOPIC]

Reply via email to