You should ask Sophie team, their knowledge certainly is far more advanced than mine.
String should be a SequenceableCollection of Character. Internally, for space/speed reasons they rather store a code representing the value of a Character. In a simple model, this value would be the Unicode encoding... In squeak, only lowest 22 bits of a Character value are used to encode the character (#charCode). Bits of rank 23 to 30 encode a so called #leadingChar. I guess we stopped at bit #30 just to be sure to handle SmallInteger values. Don't count on me to explain leadingChar, I can't... For leadingChar ~~ 0, i'm not event sure of correct charCode interpretation... For value < 256, the interpretation of the charCode is not exactly unicode... It's more CP1252 (with assigned values to codes from 128 to 159). Once upon a time, it used to be Mac Roman encoding instead... Let's forget the past (but you could so some remnants in old code). ------------------------------ When marshalling/unmarshalling strings to/from outside world we could/should use ByteArray... Unwisely, we don't. Instead, we reuse a String as storage for these codes. As a result, you see all these squeakToUtf8, utf8ToSqueak etc... That means that the contents of the String cannot be interpreted outside of its context... Very very bad IMHO. Under this point of view, the String has no more a self-contained meaning, but is just a blob of codes (on 8 or 32 bits). Fortunately, we mostly use these forms for temporary storage, but even, I don't like it. There are other alternatives like defining subclasses of String that encapsulate their encodings and know how to be well behaved Strings, not just context dependent blobs. For example, you could as well define an UT8String. VW went on this kind of path long time ago (not sure for utf8 though). Well, I'm not sure whether I succeeded in explaining something at all or just added confusion... Anyway, Unicode is not simple, because it attempts to represent several centuries of typesetting conventions of different cultures... So don't expect the code to be as simple as in the ASCII times. It forces you to ask what is a character at all? Several glyphs exist for the same character (upper and lower case for a latin example), some characters can be decomposed as a base character and a diacritical mark, etc... Character rendering is even worse, with kerning, ligatures, anti aliasing, hinting, etc... Designing a font of good quality is a lot of work, especially if you have to support unicode ! If it's getting too complex and we don't get the task force to handle it, we'd better hook OS primitives to measure/render. I guess it is far beyond you original question, but that will arise soon, because without good fonts and good rendering, Unicode support is kind of void. Nicolas 2010/3/28 Stéphane Ducasse <[email protected]>: > Hi > > I'm trying to remember the situation with the internal representation of > string in pharo/squeak > to revise > http://book.seaside.st/book/in-action/serving-files/character-encodings/seaside-pharo > > I saw that in pharo we have this NonASCIIMap. I do not remember what have > been done in pharo. > Argh memory leaks.... Nicolas do you remember the situation? > > In this context what is the squeakToUTF8 related behavior? > is squeak still using latin-1 or in the midst of changing? > > Stef > > > _______________________________________________ > Pharo-project mailing list > [email protected] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > _______________________________________________ Pharo-project mailing list [email protected] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
