007/3/27, Daniel B. <[EMAIL PROTECTED]>:
What about when it breaks a string into substrings at some delimiter, say, using a regular expression? It has to break the underlying byte string at a character boundary.
Unless you pass invalid utf-8 sequences to your regular expression library, that should be impossible. breaking strings works great as long as you pattern match for boundaries. The only time it fails is if you break it at arbitrary byte indexes.note that breaking utf-32 strings at arbirtrary indicies also destroys the text.
In fact, what about interpreting an underlying string of bytes as as the right individual characters in that regular expression?
The regular expression engine should be utf-8 aware. The code that uses and calls it has no need to.
Any time a program uses the underlying byte string as a character string other than simply a whole string (e.g., breaking it apart, interpreting it), it needs to consider it at the character level, not the byte level.
Only the most fancy intepretations require any knowledge of unicode code points.Any substring match on valid sequences will produce valid boundaries in utf-8,and thats the whole point.
