???????? wrote:
> 
> > That would be contradictory to the whole concept of Unicode. A
> > human-readable string should never be considered an array of bytes, it is an
> > array of characters!
> 
> Hrm, that statement I think I would object to. For the overwhelming
> vast majority of programs, strings are simply arrays of bytes.
> (regardless of encoding) The only time source code needs to care about
> characters is when it has to layout or format them for display.

What about when it breaks a string into substrings at some delimiter,
say, using a regular expression?  It has to break the underlying byte 
string at a character boundary.

In fact, what about interpreting an underlying string of bytes as
as the right individual characters in that regular expression?  

Any time a program uses the underlying byte string as a character
string other than simply a whole string (e.g., breaking it apart, 
interpreting it), it needs to consider it at the character level,
not the byte level.



> When I write a basic little perl script that reads in lines from a
> file, does trivial string operations on them, then prints them back
> out, there should be absolutely no need for my code to make any
> special considerations for encoding.

It depends how trivial the operations are.

(Offhand, the only things I think would be safe are copying and
appending.)


Daniel
-- 
Daniel Barclay
[EMAIL PROTECTED]

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to