Re: [Pharo-project] squeakToUTF-8 and related?

Philippe Marschall Tue, 30 Mar 2010 09:34:18 -0700

Nicolas Cellier wrote:
> 2010/3/28 Stéphane Ducasse <[email protected]>:
>>> You should ask Sophie team, their knowledge certainly is far more
>>> advanced than mine.
>> The problem is that most of them disappeared after the java rewrite announce.
>>
>>> String should be a SequenceableCollection of Character.
>>> Internally, for space/speed reasons they rather store a code
>>> representing the value of a Character.
>>>
>>> In a simple model, this value would be the Unicode encoding...
>>> In squeak, only lowest 22 bits of a Character value are used to encode
>>> the character (#charCode).
>>> Bits of rank 23 to 30 encode a so called #leadingChar.
>>> I guess we stopped at bit #30 just to be sure to handle SmallInteger values.
>>> Don't count on me to explain leadingChar, I can't...
>> :)
>> I read the comment of the class and some code and got lost
>>
>>> For leadingChar ~~ 0, i'm not event sure of correct charCode 
>>> interpretation...
>>>
>>> For value < 256, the interpretation of the charCode is not exactly 
>>> unicode...
>>> It's more CP1252 (with assigned values to codes from 128 to 159).
>>>
>>> Once upon a time, it used to be Mac Roman encoding instead...
>>> Let's forget the past (but you could so some remnants in old code).
>>>
>>> ------------------------------
>>>
>>> When marshalling/unmarshalling strings to/from outside world we
>>> could/should use ByteArray...
>>> Unwisely, we don't.
>>> Instead, we reuse a String as storage for these codes.
>>> As a result, you see all these squeakToUtf8, utf8ToSqueak etc...
>>> That means that the contents of the String cannot be interpreted
>>> outside of its context... Very very bad IMHO.
>>> Under this point of view, the String has no more a self-contained
>>> meaning, but is just a blob of codes (on 8 or 32 bits).
>>> Fortunately, we mostly use these forms for temporary storage, but
>>> even, I don't like it.
>>>
>>> There are other alternatives like defining subclasses of String that
>>> encapsulate their encodings and know how to be well behaved Strings,
>>> not just context dependent blobs.
>>> For example, you could as well define an UT8String.
>>> VW went on this kind of path long time ago (not sure for utf8 though).
>>>
>>> Well, I'm not sure whether I succeeded in explaining something at all
>>> or just added confusion...
>> don;t worry.
>> for the seaside book I started to read unicode standard and history now it 
>> would be good to
>> know what to do and do it :)
>>
> 
> Ask Seaside folks, they certainly have some ideas.


I don't like the #squeakToXXX methods because for every encoding you
support you need to add a method. That's why I prefer the
#convertToEncoding: method, one method for every use case.

I like even less the #xxxToSquak methods, the #convertFromEnoding:
method does the same. In addition you have the fact that you're dealing
with strings not in the native Squeak format. If you pass them anywhere
you're unlikely to get the expected result. I prefer ByteArrays for this
use case which have no semantic and make it clear that it's not a native
string.

Cheers
Philippe


_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Re: [Pharo-project] squeakToUTF-8 and related?

Reply via email to