Re: [Pharo-project] squeakToUTF-8 and related?

Nicolas Cellier Sun, 28 Mar 2010 10:31:04 -0700

2010/3/28 Stéphane Ducasse <[email protected]>:
>
>> You should ask Sophie team, their knowledge certainly is far more
>> advanced than mine.
>
> The problem is that most of them disappeared after the java rewrite announce.
>
>> String should be a SequenceableCollection of Character.
>> Internally, for space/speed reasons they rather store a code
>> representing the value of a Character.
>>
>> In a simple model, this value would be the Unicode encoding...
>> In squeak, only lowest 22 bits of a Character value are used to encode
>> the character (#charCode).
>> Bits of rank 23 to 30 encode a so called #leadingChar.
>> I guess we stopped at bit #30 just to be sure to handle SmallInteger values.
>> Don't count on me to explain leadingChar, I can't...
>
> :)
> I read the comment of the class and some code and got lost
>
>> For leadingChar ~~ 0, i'm not event sure of correct charCode 
>> interpretation...
>>
>> For value < 256, the interpretation of the charCode is not exactly unicode...
>> It's more CP1252 (with assigned values to codes from 128 to 159).
>>
>> Once upon a time, it used to be Mac Roman encoding instead...
>> Let's forget the past (but you could so some remnants in old code).
>>
>> ------------------------------
>>
>> When marshalling/unmarshalling strings to/from outside world we
>> could/should use ByteArray...
>> Unwisely, we don't.
>> Instead, we reuse a String as storage for these codes.
>> As a result, you see all these squeakToUtf8, utf8ToSqueak etc...
>> That means that the contents of the String cannot be interpreted
>> outside of its context... Very very bad IMHO.
>> Under this point of view, the String has no more a self-contained
>> meaning, but is just a blob of codes (on 8 or 32 bits).
>> Fortunately, we mostly use these forms for temporary storage, but
>> even, I don't like it.
>>
>> There are other alternatives like defining subclasses of String that
>> encapsulate their encodings and know how to be well behaved Strings,
>> not just context dependent blobs.
>> For example, you could as well define an UT8String.
>> VW went on this kind of path long time ago (not sure for utf8 though).
>>
>> Well, I'm not sure whether I succeeded in explaining something at all
>> or just added confusion...
>
> don;t worry.
> for the seaside book I started to read unicode standard and history now it 
> would be good to
> know what to do and do it :)
>


Ask Seaside folks, they certainly have some ideas.

>> Anyway, Unicode is not simple, because it attempts to represent
>> several centuries of typesetting conventions of different cultures...
>> So don't expect the code to be as simple as in the ASCII times.
>> It forces you to ask what is a character at all? Several glyphs exist
>> for the same character (upper and lower case for a latin example),
>> some characters can be decomposed as a base character and a
>> diacritical mark, etc...
>
> Yes I read that.
>>
>> Character rendering is even worse, with kerning, ligatures, anti
>> aliasing, hinting, etc...
>> Designing a font of good quality is a lot of work, especially if you
>> have to support unicode !
>> If it's getting too complex and we don't get the task force to handle
>> it, we'd better hook OS primitives to measure/render.
>> I guess it is far beyond you original question, but that will arise
>> soon, because without good fonts and good rendering, Unicode support
>> is kind of void.
>
> Yes this is all the question of a community not moving during 10 years (not 
> only squeak)
> and the world making progress and more important getting more and more 
> complex.
> So may be relying on external libraries will be more and more important 
> (which I do not like).

We have the Cuis alternative for simplicity.
Not sure we should waste time competing in areas where we can't win.
Seaside just take advantages of web standards and browsers, and it's
the main commercial Smalltalk niche these days, isn't it ?

Nicolas

>> Nicolas
>>
>> 2010/3/28 Stéphane Ducasse <[email protected]>:
>>> Hi
>>>
>>> I'm trying to remember the situation with the internal representation of 
>>> string in pharo/squeak
>>> to revise 
>>> http://book.seaside.st/book/in-action/serving-files/character-encodings/seaside-pharo
>>>
>>> I saw that in pharo we have this NonASCIIMap. I do not remember what have 
>>> been done in pharo.
>>> Argh memory leaks.... Nicolas do you remember the situation?
>>>
>>> In this context what is the squeakToUTF8 related behavior?
>>> is squeak still using latin-1 or in the midst of changing?
>>>
>>> Stef
>>>
>>>
>>> _______________________________________________
>>> Pharo-project mailing list
>>> [email protected]
>>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>>>
>>
>> _______________________________________________
>> Pharo-project mailing list
>> [email protected]
>> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>
>
> _______________________________________________
> Pharo-project mailing list
> [email protected]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Re: [Pharo-project] squeakToUTF-8 and related?

Reply via email to