On Tue, Mar 27, 2007 at 10:07:11PM -0400, Daniel B. wrote:
> ???????? wrote:
> > 
> > > That would be contradictory to the whole concept of Unicode. A
> > > human-readable string should never be considered an array of bytes, it is 
> > > an
> > > array of characters!
> > 
> > Hrm, that statement I think I would object to. For the overwhelming
> > vast majority of programs, strings are simply arrays of bytes.
> > (regardless of encoding) The only time source code needs to care about
> > characters is when it has to layout or format them for display.
> 
> What about when it breaks a string into substrings at some delimiter,
> say, using a regular expression?  It has to break the underlying byte 
> string at a character boundary.

Searching for the delimeter already gives you a character boundary.
There is no need to think further about it.

For example, the unix "cut" program works automatically with UTF-8
text as long as the delimiter is a single byte, and if you want
multibyte delimiters, all you need to do is make it accept a multibyte
delimeter character and then do a substring search instead of a byte
search. There is no need to ever treat the input string as characters,
and in fact doing so just makes it slow and bloated.

> In fact, what about interpreting an underlying string of bytes as
> as the right individual characters in that regular expression?  
> 
> Any time a program uses the underlying byte string as a character
> string other than simply a whole string (e.g., breaking it apart, 
> interpreting it), it needs to consider it at the character level,
> not the byte level.

You're mistaken. Most times, you can avoid thinking about characters
totally. Not always, but much more often than you think.

> > When I write a basic little perl script that reads in lines from a
> > file, does trivial string operations on them, then prints them back
> > out, there should be absolutely no need for my code to make any
> > special considerations for encoding.
> 
> It depends how trivial the operations are.
> 
> (Offhand, the only things I think would be safe are copying and
> appending.)

This is because you don't understand UTF-8..

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to