On 28.05.14 22:24, Andres Valloud wrote:
Hey Philippe,
Yes but #= is blissfully unaware of normalization in Squeak/Pharo. In
fact AFAIK Squeak/Pharo is unaware of normalization. Having a short look
at it doesn't even look as if case insensitivity worked in Squeak/Pharo
outside of Latin-1 (I could be wrong though).
Yes, that's what I am thinking about. To be more explicit, suppose
"Unicode" series of characters got into the image via the keyboard, a
file, a socket... once decoded, what could one do with them? Are all
types of decoded character series going to be represented as instances
of a single class, although they have inherently different behavior?
I don't understand the question.
In addition you probably don't want #= to do normalization "because
performance". And even if you did you probably still want a fast path
for ByteString receiver and ByteString argument in which case #size is
safe.
Assuming all fixed width representation strings (e.g. byte strings) will
always have the same encoding (e.g. same code page), then the size check
for those seems ok to me.
All Strings are fixed width in Pharo/Squeak. If you have a single
non-Latin1 character (code point) in a String all characters in the
String are promoted from a byte to an OOP. #size then answers the number
of OOPs instead of the number of bytes. So #size always answers the
number of characters (code points) non-normalized (because there is no
way to do normalization in Pharo/Squeak).
Just to make sure, I am not celebrating all this complexity in the
world... however, given that it's there, how are we going to deal with
it? I'm concerned about the long term consequences of making things
more complex than they are by reinterpreting them. The problem I see is
that ultimately programs just won't Work(TM).
This seems to me a more general discussion than the problem at hand.
Again at the moment Pharo/Squeak is largely unaware of Unicode. It
supports very large code points but without any semantics (similar to
let's say Erlang with doesn't even have a String type). What
Pharo/Squeak however does know about is encodings for mapping the
Strings to and from bytes.
And quite honestly doing normalization in #= may cause things to just
won't Work(TM). Consider for example an HTTP request with a query string
with two different query fields that are normalized equal. Would you
want the values to be stored under one dictionary key?
Cheers
Philippe