Re: [Pharo-dev] [squeak-dev] Re: String >> #=

Nicolas Cellier Wed, 28 May 2014 14:07:47 -0700

2014-05-28 22:24 GMT+02:00 Andres Valloud <[email protected]
>:


> Hey Philippe,
>
>
>  Yes but #= is blissfully unaware of normalization in Squeak/Pharo. In
>> fact AFAIK Squeak/Pharo is unaware of normalization. Having a short look
>> at it doesn't even look as if case insensitivity worked in Squeak/Pharo
>> outside of Latin-1 (I could be wrong though).
>>
>
> Yes, that's what I am thinking about.  To be more explicit, suppose
> "Unicode" series of characters got into the image via the keyboard, a file,
> a socket... once decoded, what could one do with them?  Are all types of
> decoded character series going to be represented as instances of a single
> class, although they have inherently different behavior?
>
>
I've got some changes pending to use Unicode for case insensitivity in
Squeak, if it's urgent i can publish now...
But since Eliot is working on Character representation in Spur, i decided
to wait a little bit.

Normalization is really a complex beast. There was an attempt by Yoshiki
(CombinedChar) to correctly render the most simple composition sequences in
now dead MultiCharacter* classes.
We decided to remove this incomplete support and postponed the rewrite for
later...
(since we would need canonical representation in other places, it's better
to separate it...)

The CombinedChar utility is still there but unused in rendering.
It is used in UnicodeCompositionStream which is used in
ParagraphEditor>>readKeyboard (yes the st80 one!)

Pharo 3.0 has a UTF8DecomposedTextConverter, but that's far less general,
is using CombinedChar and is unused in base.


>
>  In addition you probably don't want #= to do normalization "because
>> performance". And even if you did you probably still want a fast path
>> for ByteString receiver and ByteString argument in which case #size is
>> safe.
>>
>
> Assuming all fixed width representation strings (e.g. byte strings) will
> always have the same encoding (e.g. same code page), then the size check
> for those seems ok to me.
>
> Just to make sure, I am not celebrating all this complexity in the
> world... however, given that it's there, how are we going to deal with it?
>  I'm concerned about the long term consequences of making things more
> complex than they are by reinterpreting them.  The problem I see is that
> ultimately programs just won't Work(TM).
>
> Andres.
>
>
Sure, Unicode is aggregating centuries of history from the whole word, how
could it be simple...
Our support is really bare minimum.

Re: [Pharo-dev] [squeak-dev] Re: String >> #=

Reply via email to