On Tue, Mar 27, 2007 at 11:53:15PM -0400, SrinTuar wrote: > 007/3/27, Daniel B. <[EMAIL PROTECTED]>: > >What about when it breaks a string into substrings at some delimiter, > >say, using a regular expression? It has to break the underlying byte > >string at a character boundary. > > > Unless you pass invalid utf-8 > sequences to your regular
Haha, was it your intent to use this huge japanese wide ascii? :) Sadly I don't think Daniel can read anything but Latin-1... Here's an ascii transliteration... ~Rich On Tue, Mar 27, 2007 at 11:53:15PM -0400, SrinTuar wrote: > 007/3/27, Daniel B. <[EMAIL PROTECTED]>: > >What about when it breaks a string into substrings at some delimiter, > >say, using a regular expression? It has to break the underlying byte > >string at a character boundary. > > Unless you pass invalid utf-8 sequences to your regular expression > library, that should be impossible. breaking strings works great as > long as you pattern match for boundaries. > > The only time it fails is if you break it at arbitrary byte > indexes.note that breaking utf-32 strings at arbirtrary indicies also > destroys the text. > > >In fact, what about interpreting an underlying string of bytes as > >as the right individual characters in that regular expression? > > The regular expression engine should be utf-8 aware. The code that > uses and calls it has no need to. > > >Any time a program uses the underlying byte string as a character > >string other than simply a whole string (e.g., breaking it apart, > >interpreting it), it needs to consider it at the character level, > >not the byte level. > > Only the most fancy intepretations require any knowledge of unicode > code points.Any substring match on valid sequences will produce valid > boundaries in utf-8,and thats the whole point. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
