(Sorry for sending this twice, Marcin.)

"Marcin 'Qrczak' Kowalczyk" writes: 
> UTF-8 is poorly suitable for internal processing of strings in a 
> modern programming language (i.e. one which doesn't already have a 
> pile of legacy functions working of bytes, but which can be designed 
> to make Unicode convenient at all). It's because code points have 
> variable lengths in bytes, so extracting individual characters is 
> almost meaningless (unless you care only about the ASCII subset, and 
> sequences of all other characters are treated as non-interpreted bags 
> of bytes). You can't even have a correct equivalent of C isspace(). 
 
That's assuming that the programming language is similar to C and Ada. 
If you're talking about a language that hides the structure of strings 
and has no problem with variable length data, then it wouldn't matter 
what the internal processing of the string looks like. You'd need to 
use iterators and discourage the use of arbitrary indexing, but arbitrary 
indexing is rarely important. 
 
You could hide combining characters, which would be extremely useful if 
we were just using Latin and Cyrillic scripts. You'd have to be flexible, 
since it would be natural to step through a Hebrew or Arabic string as if the 
vowels were written inline, and people might want to look at the combining 
characters (which would be incredibly rare if your language already 
provided most standard Unicode functions.) 
 
-- 
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm



Reply via email to