On Sat, Apr 07, 2007 at 08:21:25PM +0200, Marcin 'Qrczak' Kowalczyk wrote:
> > Using UTF-8 would have accomplished the same thing without
> > special-casing.
> 
> Then iterating over strings and specifying string fragments could not be
> done by code point indices, and it’s not obvious how a good interface
> should look like.

One idea is to have a 'point' in a string be an abstract data type
rather than just an integer index. In reality it would just be a UTF-8
byte offset.

> Operations like splitting on whitespace would no
> longer have simple implementations based on examining successive code
> points.

Sure it would. Accessing a character at a point would still evaluate
to a character. Instead of if/else for 8bit/32bit string, you'd just
have a UTF-8 operation.

> This still rules out bounds checking. If each s[i] among 4096 indexing
> operations has the cost of 4096-i, then 8M might become noticeable.

Indeed this is true. Here's a place where you're very right, a HLL
which does bounds checking will want to know (at the implementation
level) the size of arrays. On the other hand this information is
useless to C, and if you're writing C, it's your responsibility to
know whether the offset you access is safe before you access it.
Different languages and target audiences/domains have different
requirements.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to