Re: [Pharo-project] squeakToUTF-8 and related?

Nicolas Cellier Mon, 29 Mar 2010 05:00:36 -0700

2010/3/29 Henrik Johansen <[email protected]>:
>
> On Mar 29, 2010, at 11:52 43AM, Nicolas Cellier wrote:
>
>> 2010/3/29 Henrik Johansen <[email protected]>:
>>>
>>> On Mar 29, 2010, at 11:16 30AM, Nicolas Cellier wrote:
>>>
>>>> I presume that under the idiom "latin1" you refer to code page 1252
>>>> rather than iso8859-L1, right ?
>>>>
>>>> Nicolas
>>> Good question :)
>>> What IS the presumed internal encoding of Bytestrings in Squeak?
>>> That's the one I meant, I merely assumed it was latin1 seeing as how the 
>>> text converter refers to it as such.
>>> Personally I thought it was iso8859-L1, seeing as the bytestring to unicode 
>>> conversion does a simple shift of chars > 127 to the 0080 - 00FF range.
>>>
>>> Cheers,
>>> Henry
>>>
>>
>> From what I understood, CP1252 is Microsoft "latin1" and use codes 128 to 
>> 159.
>> ISO8859-L1 match fisrt 256 codes of unicode latin-1 and has codes 128
>> to 159 unused.
>> You know, when Microsoft "uses" a standard, it's always a better standard ;)
>>
>> I have nothing against CP1252, it's an optimization which avoid
>> wasting 32 cheap codes.
>> But I'm not sure about various compatibility issues in/with the
>> external world...
>>
>> Squeak clearly uses CP1252.
>> For Pharo, there might be a mix of the two since Sophie-like
>> refactorings. Surely what John was refering to.
>>
>> Nicolas
>
> Ummm...
> All the utf8-converters in squeak use Unicode value:, which maps directly 
> from charCode 128->255 to Unicode value 128->255.
> Unicode value 128->255 IS iso8859-L1, so if squeak uses CP1252 as internal 
> format, all the converters in Squeak are wrong.
>
> Cheers,
> Henry
>


ISO8859-L1 and CP1252 only differ for code points 16r80 to 16r9F.
Contrarily to what I said, these code points are assigned to G1
control characters (anyone ever used these ?).
See http://en.wikipedia.org/wiki/ISO_8859-1 and
http://en.wikipedia.org/wiki/Windows-1252

Now, I'm not so sure anymore why I thought squeak was CP1252. Is it ?
My guess was probably based on macToSqueak and squeakToMac implementation.
But endering of following snippet isn't CP1252 complying:

String withAll: ((16r80 to: 16r9F) collect: [:e | Character value: e])
or
(16r80 to: 16r9F) collect: [:e | Character value: e] as: String
''

In Squeak 4.1 the different fonts don't agree on rendering these characters...
DefaultFixedTextStyle is still using MacRoman and display accented characters.
DefaultTextStyle hack first 4 entries with caret underscore left arrow
and up arrow (probably a Cuis hack)
Accu* just seem to have a hack for left arrow
Maybe with a bit more clean-up (Character euro is answering the
MacRoman code for example, and taking macRoman conversions from
Sophie/Pharo), we could declare Squeak is using unicode...
Great !

Nicolas


>
> _______________________________________________
> Pharo-project mailing list
> [email protected]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
>

_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Re: [Pharo-project] squeakToUTF-8 and related?

Reply via email to