On Tue, Mar 27, 2007 at 11:53:15PM -0400, SrinTuar wrote:
> 007/3/27, Daniel B. <[EMAIL PROTECTED]>:
> >What about when it breaks a string into substrings at some delimiter,
> >say, using a regular expression?  It has to break the underlying byte
> >string at a character boundary.
> 
> 
> Unless you pass invalid utf-8 
> sequences to your regular 

Haha, was it your intent to use this huge japanese wide ascii? :)
Sadly I don't think Daniel can read anything but Latin-1...
Here's an ascii transliteration...
~Rich


On Tue, Mar 27, 2007 at 11:53:15PM -0400, SrinTuar wrote:
> 007/3/27, Daniel B. <[EMAIL PROTECTED]>:
> >What about when it breaks a string into substrings at some delimiter,
> >say, using a regular expression?  It has to break the underlying byte
> >string at a character boundary.
> 
> Unless you pass invalid utf-8 sequences to your regular expression
> library, that should be impossible. breaking strings works great as
> long as you pattern match for boundaries.
> 
> The only time it fails is if you break it at arbitrary byte
> indexes.note that breaking utf-32 strings at arbirtrary indicies also
> destroys the text.
> 
> >In fact, what about interpreting an underlying string of bytes as
> >as the right individual characters in that regular expression?
> 
> The regular expression engine should be utf-8 aware. The code that
> uses and calls it has no need to.
> 
> >Any time a program uses the underlying byte string as a character
> >string other than simply a whole string (e.g., breaking it apart,
> >interpreting it), it needs to consider it at the character level,
> >not the byte level.
> 
> Only the most fancy intepretations require any knowledge of unicode
> code points.Any substring match on valid sequences will produce valid
> boundaries in utf-8,and thats the whole point.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to