EuanM wrote
> ...
>         all ISO-8859-1 maps 1:1 to Unicode UTF-8
> ...

I am late coming in to this conversation. If it hasn't already been said,
please do not conflate Unicode and UTF-8. I think that would be a recipe for
a high P.I.T.A. factor.

Unicode defines the meaning of the code points.
UTF-8 (and -16) define an interchange mechanism.

In other words, when you write the code points to an external medium
(socket, file, whatever), encode them via UTF-whatever. Read UTF-whatever
from an external medium and re-instantiate the code points.
(Personally, I see no use for UTF-16 as an interchange mechanism. Others may
have justification for it. I don't.)

Having characters be a consistent size in their object representation makes
everything easier. #at:, #indexOf:, #includes: ... no one wants to be
scanning through bytes representing variable sized characters.

Model Unicode strings using classes such as e.g. Unicode7, Unicode16, and
Unicode32, with automatic coercion to the larger character width.




--
View this message in context: 
http://forum.world.st/Unicode-Support-tp4865139p4866610.html
Sent from the Pharo Smalltalk Developers mailing list archive at Nabble.com.

Reply via email to