???????? wrote: > > > That would be contradictory to the whole concept of Unicode. A > > human-readable string should never be considered an array of bytes, it is an > > array of characters! > > Hrm, that statement I think I would object to. For the overwhelming > vast majority of programs, strings are simply arrays of bytes. > (regardless of encoding) The only time source code needs to care about > characters is when it has to layout or format them for display.
What about when it breaks a string into substrings at some delimiter, say, using a regular expression? It has to break the underlying byte string at a character boundary. In fact, what about interpreting an underlying string of bytes as as the right individual characters in that regular expression? Any time a program uses the underlying byte string as a character string other than simply a whole string (e.g., breaking it apart, interpreting it), it needs to consider it at the character level, not the byte level. > When I write a basic little perl script that reads in lines from a > file, does trivial string operations on them, then prints them back > out, there should be absolutely no need for my code to make any > special considerations for encoding. It depends how trivial the operations are. (Offhand, the only things I think would be safe are copying and appending.) Daniel -- Daniel Barclay [EMAIL PROTECTED] -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
