On Tue, Mar 27, 2007 at 10:07:11PM -0400, Daniel B. wrote: > ???????? wrote: > > > > > That would be contradictory to the whole concept of Unicode. A > > > human-readable string should never be considered an array of bytes, it is > > > an > > > array of characters! > > > > Hrm, that statement I think I would object to. For the overwhelming > > vast majority of programs, strings are simply arrays of bytes. > > (regardless of encoding) The only time source code needs to care about > > characters is when it has to layout or format them for display. > > What about when it breaks a string into substrings at some delimiter, > say, using a regular expression? It has to break the underlying byte > string at a character boundary.
Searching for the delimeter already gives you a character boundary. There is no need to think further about it. For example, the unix "cut" program works automatically with UTF-8 text as long as the delimiter is a single byte, and if you want multibyte delimiters, all you need to do is make it accept a multibyte delimeter character and then do a substring search instead of a byte search. There is no need to ever treat the input string as characters, and in fact doing so just makes it slow and bloated. > In fact, what about interpreting an underlying string of bytes as > as the right individual characters in that regular expression? > > Any time a program uses the underlying byte string as a character > string other than simply a whole string (e.g., breaking it apart, > interpreting it), it needs to consider it at the character level, > not the byte level. You're mistaken. Most times, you can avoid thinking about characters totally. Not always, but much more often than you think. > > When I write a basic little perl script that reads in lines from a > > file, does trivial string operations on them, then prints them back > > out, there should be absolutely no need for my code to make any > > special considerations for encoding. > > It depends how trivial the operations are. > > (Offhand, the only things I think would be safe are copying and > appending.) This is because you don't understand UTF-8.. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
