On Wed, 2007-03-28 at 23:59 -0700, Erick Tryzelaar wrote:
> skaller wrote:
> > Ops like trim/strip will simply miss some whitespaces, they won't
> > do the wrong thing provided they treat high bit set chars as non-space.
> >   
> 
> That's only if you're using trim/strip to remove whitespace. You can 
> also use them to trim substrings, or strip all the chars in a string. 
> Those won't work on unicode.

I think they will -- UTF8 is designed to allow that.

That is: if you say want to remove a substring S from T,
then searching T for S will never find a wrong substring --
unless S is an illegal substring.

for example if you search T for S = 0x89 or something,
then you will certainly mess up .. but 0x89 isn't a legal
utf8 character, so what do you expect?

Look at the encoding rules: I claim the following
rule holds: given any two unicode characters
with UTF8 encodings (for example): 

        c0 c1 c2 c3 d0 d1 d2 d3

then if any subsequence of the above string is a legal UTF8 char,
it must be c0 c1 c2 c3 or d0 d1 d2 d3 .. no other subsequence
of the above string is a legal encoding.

A simpler invariant is: there is no UTF8 encoding starting
with or c1, c2, c3, indeed there is no UTF8 encoding where
any character is out of sync. If you hit a byte

        0xXX

then XX tells you which byte of the multibyte sequence it is.

So given any pointer into a utf8 string you can find the
first byte of the sequence it is in the middle of:

The ranges of the bytes in each sequence position are exclusive.

So roughly speaking, ANY 'nice' stream operation on 8 bit chars will
also work for UTF-8 and preserve semantics (character meaning).

I won't define 'nice' here, but it includes not only searching
but also sorting.


-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Felix-language mailing list
Felix-language@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/felix-language

Reply via email to