You should ask Sophie team, their knowledge certainly is far more
advanced than mine.

String should be a SequenceableCollection of Character.
Internally, for space/speed reasons they rather store a code
representing the value of a Character.

In a simple model, this value would be the Unicode encoding...
In squeak, only lowest 22 bits of a Character value are used to encode
the character (#charCode).
Bits of rank 23 to 30 encode a so called #leadingChar.
I guess we stopped at bit #30 just to be sure to handle SmallInteger values.
Don't count on me to explain leadingChar, I can't...
For leadingChar ~~ 0, i'm not event sure of correct charCode interpretation...

For value < 256, the interpretation of the charCode is not exactly unicode...
It's more CP1252 (with assigned values to codes from 128 to 159).

Once upon a time, it used to be Mac Roman encoding instead...
Let's forget the past (but you could so some remnants in old code).

------------------------------

When marshalling/unmarshalling strings to/from outside world we
could/should use ByteArray...
Unwisely, we don't.
Instead, we reuse a String as storage for these codes.
As a result, you see all these squeakToUtf8, utf8ToSqueak etc...
That means that the contents of the String cannot be interpreted
outside of its context... Very very bad IMHO.
Under this point of view, the String has no more a self-contained
meaning, but is just a blob of codes (on 8 or 32 bits).
Fortunately, we mostly use these forms for temporary storage, but
even, I don't like it.

There are other alternatives like defining subclasses of String that
encapsulate their encodings and know how to be well behaved Strings,
not just context dependent blobs.
For example, you could as well define an UT8String.
VW went on this kind of path long time ago (not sure for utf8 though).

Well, I'm not sure whether I succeeded in explaining something at all
or just added confusion...

Anyway, Unicode is not simple, because it attempts to represent
several centuries of typesetting conventions of different cultures...
So don't expect the code to be as simple as in the ASCII times.
It forces you to ask what is a character at all? Several glyphs exist
for the same character (upper and lower case for a latin example),
some characters can be decomposed as a base character and a
diacritical mark, etc...

Character rendering is even worse, with kerning, ligatures, anti
aliasing, hinting, etc...
Designing a font of good quality is a lot of work, especially if you
have to support unicode !
If it's getting too complex and we don't get the task force to handle
it, we'd better hook OS primitives to measure/render.
I guess it is far beyond you original question, but that will arise
soon, because without good fonts and good rendering, Unicode support
is kind of void.

Nicolas

2010/3/28 Stéphane Ducasse <[email protected]>:
> Hi
>
> I'm trying to remember the situation with the internal representation of 
> string in pharo/squeak
> to revise 
> http://book.seaside.st/book/in-action/serving-files/character-encodings/seaside-pharo
>
> I saw that in pharo we have this NonASCIIMap. I do not remember what have 
> been done in pharo.
> Argh memory leaks.... Nicolas do you remember the situation?
>
> In this context what is the squeakToUTF8 related behavior?
> is squeak still using latin-1 or in the midst of changing?
>
> Stef
>
>
> _______________________________________________
> Pharo-project mailing list
> [email protected]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Reply via email to