Rich Felker wrote: > > On Tue, Mar 27, 2007 at 10:07:11PM -0400, Daniel B. wrote: ... > > > > What about when it breaks a string into substrings at some delimiter, > > say, using a regular expression? It has to break the underlying byte > > string at a character boundary. > > Searching for the delimeter already gives you a character boundary. > There is no need to think further about it.
As long as you specified the delimiter properly (a whole character, not a partial byte sequence). > For example, the unix "cut" program works automatically with UTF-8 > text as long as the delimiter is a single byte, By "single byte," do you mean a character whose UTF-8 representation is a single byte? (If you gave it the byte 0xBF, would it reject it as an invalid UTF-8 sequence, or would it then possibly cut in the middle of the byte sequence for a character (e.g., 0xEF 0xBF 0x00)?) > > > When I write a basic little perl script that reads in lines from a > > > file, does trivial string operations on them, then prints them back > > > out, there should be absolutely no need for my code to make any > > > special considerations for encoding. > > > > It depends how trivial the operations are. > > > > (Offhand, the only things I think would be safe are copying and > > appending.) > > This is because you don't understand UTF-8.. Bull. Try providing some real information (a couple of counterexamples). Daniel -- Daniel Barclay [EMAIL PROTECTED] -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
