Re: [Pharo-project] squeakToUTF-8 and related?

Henrik Johansen Mon, 29 Mar 2010 06:54:36 -0700

On Mar 29, 2010, at 2:00 09PM, Nicolas Cellier wrote:

> 2010/3/29 Henrik Johansen <[email protected]>:
>> 
>> On Mar 29, 2010, at 11:52 43AM, Nicolas Cellier wrote:
>> 
>>> 2010/3/29 Henrik Johansen <[email protected]>:
>>>> 
>>>> On Mar 29, 2010, at 11:16 30AM, Nicolas Cellier wrote:
>>>> 
>>>>> I presume that under the idiom "latin1" you refer to code page 1252
>>>>> rather than iso8859-L1, right ?
>>>>> 
>>>>> Nicolas
>>>> Good question :)
>>>> What IS the presumed internal encoding of Bytestrings in Squeak?
>>>> That's the one I meant, I merely assumed it was latin1 seeing as how the 
>>>> text converter refers to it as such.
>>>> Personally I thought it was iso8859-L1, seeing as the bytestring to 
>>>> unicode conversion does a simple shift of chars > 127 to the 0080 - 00FF 
>>>> range.
>>>> 
>>>> Cheers,
>>>> Henry
>>>> 
>>> 
>>> From what I understood, CP1252 is Microsoft "latin1" and use codes 128 to 
>>> 159.
>>> ISO8859-L1 match fisrt 256 codes of unicode latin-1 and has codes 128
>>> to 159 unused.
>>> You know, when Microsoft "uses" a standard, it's always a better standard ;)
>>> 
>>> I have nothing against CP1252, it's an optimization which avoid
>>> wasting 32 cheap codes.
>>> But I'm not sure about various compatibility issues in/with the
>>> external world...
>>> 
>>> Squeak clearly uses CP1252.
>>> For Pharo, there might be a mix of the two since Sophie-like
>>> refactorings. Surely what John was refering to.
>>> 
>>> Nicolas
>> 
>> Ummm...
>> All the utf8-converters in squeak use Unicode value:, which maps directly 
>> from charCode 128->255 to Unicode value 128->255.
>> Unicode value 128->255 IS iso8859-L1, so if squeak uses CP1252 as internal 
>> format, all the converters in Squeak are wrong.
>> 
>> Cheers,
>> Henry
>> 
> 
> ISO8859-L1 and CP1252 only differ for code points 16r80 to 16r9F.
> Contrarily to what I said, these code points are assigned to G1
> control characters (anyone ever used these ?).
> See http://en.wikipedia.org/wiki/ISO_8859-1 and
> http://en.wikipedia.org/wiki/Windows-1252


Not to my knowledge :) 
The strong argument for using latin1 as internal charset for ByteString vs 1252 
is the 1-1 mapping to unicode values.

> 
> Now, I'm not so sure anymore why I thought squeak was CP1252. Is it ?
Seems ambiguous.

> My guess was probably based on macToSqueak and squeakToMac implementation.

Yes, that does indeed do MacRoman -> 1252 transformation. As does 
MacRomanTextConverter, in Pharo as well...
Converters assuming different internal encodings, fonts which render a charset 
different from both of them... Fun eh?

> But endering of following snippet isn't CP1252 complying:
> 
> String withAll: ((16r80 to: 16r9F) collect: [:e | Character value: e])
> or
> (16r80 to: 16r9F) collect: [:e | Character value: e] as: String
> '•™≠∞≥∑∫Ω√≈…—‘Ÿ⁄∂∆Œ‚„‰ˆ˜˘˙˚˝˛ˇıƒ'
> 
> In Squeak 4.1 the different fonts don't agree on rendering these characters...
> DefaultFixedTextStyle is still using MacRoman and display accented characters.
> DefaultTextStyle hack first 4 entries with caret underscore left arrow
Yup, Bitmap DejaVu is latin15 (some characters different from latin1, amongst 
them the € ), with 4 extra entries as you mentioned.
> and up arrow (probably a Cuis hack)
> Accu* just seem to have a hack for left arrow
Yeah, they seem to cover... a blend of latin1, latin15 (has euro symbol), and 
something else (square-root :D ). Wee.

Render with a Unicode font, and you get nothing but []'s, which would be the 
correct latin1-rendering of said string.

Which is why I said an encoding property for the StrikeFonts was needed, so you 
can do the proper conversion of internal string charcodes to the charcode 
values the font expects. (Or rather, bitmap offsets)
This of course means you'd have to come up with a  consistent definition of 
what the internal ByteString encoding in Squeak is first, though. 


> Maybe with a bit more clean-up (Character euro is answering the
> MacRoman code for example,
The keyboardinput handling in Squeak does strange things, at least on a Mac...
Alt - § (which gives a euro symbol on my keyboard layout) is read as a WideChar 
with the correct unicode value on Pharo, but as Char 164 in Squeak.
Alt- 5 (∞) does a similar thing, reads as correct widechar on Pharo, but on 
Squeak turns into char 129.
> and taking macRoman conversions from
> Sophie/Pharo), we could declare Squeak is using unicode...
> Great !
> 
> Nicolas


That would be my dream as well. 
Or really, I'd settle for any unambiguous definition of what the ByteString 
encoding is.
"A little more clean-up" may or may not be an understatement  though, it would 
involve going through all the converters, all keyboard-input processing code 
(seems to be more stable in Pharo on mac), and all places where strings 
enters/leaves the system. :)

Cheers,
Henry


_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Re: [Pharo-project] squeakToUTF-8 and related?

Reply via email to