Rich Felker wrote:
> 
> On Tue, Mar 27, 2007 at 10:07:11PM -0400, Daniel B. wrote:
...
> >
> > What about when it breaks a string into substrings at some delimiter,
> > say, using a regular expression?  It has to break the underlying byte
> > string at a character boundary.
> 
> Searching for the delimeter already gives you a character boundary.
> There is no need to think further about it.

As long as you specified the delimiter properly (a whole character,
not a partial byte sequence).


> For example, the unix "cut" program works automatically with UTF-8
> text as long as the delimiter is a single byte, 

By "single byte," do you mean a character whose UTF-8 representation
is a single byte?  (If you gave it the byte 0xBF, would it reject it
as an invalid UTF-8 sequence, or would it then possibly cut in the middle
of the byte sequence for a character (e.g., 0xEF 0xBF 0x00)?)
 


> > > When I write a basic little perl script that reads in lines from a
> > > file, does trivial string operations on them, then prints them back
> > > out, there should be absolutely no need for my code to make any
> > > special considerations for encoding.
> >
> > It depends how trivial the operations are.
> >
> > (Offhand, the only things I think would be safe are copying and
> > appending.)
> 
> This is because you don't understand UTF-8..

Bull.  Try providing some real information (a couple of counterexamples).


Daniel
-- 
Daniel Barclay
[EMAIL PROTECTED]

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to