On Wed, 2007-03-28 at 23:59 -0700, Erick Tryzelaar wrote: > skaller wrote: > > Ops like trim/strip will simply miss some whitespaces, they won't > > do the wrong thing provided they treat high bit set chars as non-space. > > > > That's only if you're using trim/strip to remove whitespace. You can > also use them to trim substrings, or strip all the chars in a string. > Those won't work on unicode.
I think they will -- UTF8 is designed to allow that. That is: if you say want to remove a substring S from T, then searching T for S will never find a wrong substring -- unless S is an illegal substring. for example if you search T for S = 0x89 or something, then you will certainly mess up .. but 0x89 isn't a legal utf8 character, so what do you expect? Look at the encoding rules: I claim the following rule holds: given any two unicode characters with UTF8 encodings (for example): c0 c1 c2 c3 d0 d1 d2 d3 then if any subsequence of the above string is a legal UTF8 char, it must be c0 c1 c2 c3 or d0 d1 d2 d3 .. no other subsequence of the above string is a legal encoding. A simpler invariant is: there is no UTF8 encoding starting with or c1, c2, c3, indeed there is no UTF8 encoding where any character is out of sync. If you hit a byte 0xXX then XX tells you which byte of the multibyte sequence it is. So given any pointer into a utf8 string you can find the first byte of the sequence it is in the middle of: The ranges of the bytes in each sequence position are exclusive. So roughly speaking, ANY 'nice' stream operation on 8 bit chars will also work for UTF-8 and preserve semantics (character meaning). I won't define 'nice' here, but it includes not only searching but also sorting. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Felix-language mailing list Felix-language@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/felix-language