007/3/27, Daniel B. <[EMAIL PROTECTED]>:
What about when it breaks a string into substrings at some delimiter,
say, using a regular expression?  It has to break the underlying byte
string at a character boundary.


Unless you pass invalid utf-8 sequences to your regular expression library, that
should be impossible. breaking strings works great as long as you
pattern match for boundaries.

The only time it fails is if you break it at arbitrary byte
indexes.note that breaking utf-32 strings at arbirtrary indicies also
destroys the text.

In fact, what about interpreting an underlying string of bytes as
as the right individual characters in that regular expression?

The regular expression engine should be utf-8 aware. The code that
uses and calls it has no need to.

Any time a program uses the underlying byte string as a character
string other than simply a whole string (e.g., breaking it apart,
interpreting it), it needs to consider it at the character level,
not the byte level.

Only the most fancy intepretations require any knowledge of unicode
code points.Any substring match on valid sequences will produce valid
boundaries in utf-8,and thats the whole point.

Reply via email to