Re: [Pharo-dev] [squeak-dev] Re: String >> #=

Philippe Marschall Fri, 30 May 2014 01:45:05 -0700

On 28.05.14 22:24, Andres Valloud wrote:

Hey Philippe,

Yes but #= is blissfully unaware of normalization in Squeak/Pharo. In
fact AFAIK Squeak/Pharo is unaware of normalization. Having a short look
at it doesn't even look as if case insensitivity worked in Squeak/Pharo
outside of Latin-1 (I could be wrong though).


Yes, that's what I am thinking about.  To be more explicit, suppose
"Unicode" series of characters got into the image via the keyboard, a
file, a socket... once decoded, what could one do with them?  Are all
types of decoded character series going to be represented as instances
of a single class, although they have inherently different behavior?


I don't understand the question.

In addition you probably don't want #= to do normalization "because
performance". And even if you did you probably still want a fast path
for ByteString receiver and ByteString argument in which case #size is
safe.


Assuming all fixed width representation strings (e.g. byte strings) will
always have the same encoding (e.g. same code page), then the size check
for those seems ok to me.

All Strings are fixed width in Pharo/Squeak. If you have a singlenon-Latin1 character (code point) in a String all characters in theString are promoted from a byte to an OOP. #size then answers the numberof OOPs instead of the number of bytes. So #size always answers thenumber of characters (code points) non-normalized (because there is noway to do normalization in Pharo/Squeak).

Just to make sure, I am not celebrating all this complexity in the
world... however, given that it's there, how are we going to deal with
it?  I'm concerned about the long term consequences of making things
more complex than they are by reinterpreting them.  The problem I see is
that ultimately programs just won't Work(TM).

This seems to me a more general discussion than the problem at hand.Again at the moment Pharo/Squeak is largely unaware of Unicode. Itsupports very large code points but without any semantics (similar tolet's say Erlang with doesn't even have a String type). WhatPharo/Squeak however does know about is encodings for mapping theStrings to and from bytes.

And quite honestly doing normalization in #= may cause things to justwon't Work(TM). Consider for example an HTTP request with a query stringwith two different query fields that are normalized equal. Would youwant the values to be stored under one dictionary key?


Cheers
Philippe

Re: [Pharo-dev] [squeak-dev] Re: String >> #=

Reply via email to