Re: [Pharo-project] Unicode/Latin1 handling in Pharo 1.3

Norbert Hartl Tue, 08 May 2012 05:13:07 -0700

Am 08.05.2012 um 14:00 schrieb Nicolas Cellier:

> leadingChar was here merely for handling Han unification
> http://en.wikipedia.org/wiki/Han_unification.
> As I understand it, the language variant is encoded in leadingChar,
> while the generic ideogram is encoded in the unicode value.
> We tried to clean a lot already and avoid using leadingChar, except
> maybe for east-asian languages, so in case of umlaut, I would classify
> this as a bug.
> 
Ok, so there is a real tension to get rid of leadingChar. Would be good then to 
have Character>>#= using charCode instead of asciiValue?


Norbert

> 2012/5/8 Norbert Hartl <[email protected]>:
>> 
>> Am 08.05.2012 um 11:41 schrieb Sven Van Caekenberghe:
>> 
>>> Hi Holger,
>>> 
>>> --
>>> Sven Van Caekenberghe
>>> http://stfx.eu
>>> Smalltalk is the Red Pill
>>> 
>>> 
>>> 
>>> 
>>> On 08 May 2012, at 10:57, Holger Hans Peter Freyther wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I am trying to read a file from disk that is in latin1 encoding and then 
>>>> try
>>>> to compare the string with a string provided as literal and it fails. My 
>>>> test
>>>> case can be seen below. I am using Pharo 1.3 on a Linux machine with a 
>>>> UTF-8
>>>> locale. Is there something obvious that I am doing wrong?
>>>> 
>>>> 
>>>> | stream text |
>>>> stream := (FileStream fileNamed: 'pharo_example_latin1.txt')
>>>>                      converter: ISO885915TextConverter new;
>>>>                      yourself.
>>>> text := stream contents.
>>>> text = 'Teilrückzahlung'
>>>> 
>>>> <pharo_example_latin1.txt>
>>> 
>>> This should work:
>>> 
>>> | stream text |
>>> stream := (FileStream fileNamed: 
>>> '/Users/sven/Desktop/pharo_example_latin1.txt')
>>>                       converter: Latin1TextConverter  new;
>>>                       yourself.
>>> text := stream contents.
>>> text = 'Teilrückzahlung'.
>>> 
>>> or this (latest Zn code):
>>> 
>>> | stream text |
>>> stream := (FileStream fileNamed: 
>>> '/Users/sven/Desktop/pharo_example_latin1.txt')
>>>                       binary;
>>>                       yourself.
>>> text := (ZnCharacterEncoder newForEncoding: 'iso-8859-15')
>>>               decodeBytes: stream contents.
>>> text = 'Teilrückzahlung'.
>>> 
>>> there seems to be an issue with ISO885915TextConverter, check the umlaut 
>>> encoding: it adds something called the leadingChar which I don't think is 
>>> needed (I don't know why it even exists).
>>> 
>> Yes, characters are mapped to characters with a leadingChar. Furthermore in 
>> Character the leadingChar value alters the output of asInteger, asciiValue, 
>> codePoint... because they all use the value of the character that includes 
>> leadingChar.But Latin1TextConverter doesn't do it so you get false for
>> 
>> (ISO885915TextConverter new byteToUnicode: $ü) = (Latin1TextConverter new 
>> byteToUnicode: $ü)
>> 
>> Because
>> 
>> Character>>#= aCharacter
>>        "Primitive. Answer true if the receiver and the argument are the same
>>        object (have the same object pointer) and false otherwise. Optional. 
>> See
>>        Object documentation whatIsAPrimitive."
>> 
>>        ^ self == aCharacter or:[
>>                aCharacter isCharacter and: [self asciiValue = aCharacter 
>> asciiValue]]
>> 
>> is using asciiValue containing in one case 252 and in the other 71303420 
>> (252 + leadingChar)
>> 
>> So leadingChar can be considered harmful. It surely was useful in the 8bit 
>> days (that followed the 7bit days). But I personally cannot see why the 
>> encoding of character should persist if a character has been created. The 
>> approach with leadingChar is keeping the 128 >= asciiValue >= 255 and add 
>> the leadingChar for the encoding (comparable to code pages). Another 
>> approach would be to have a homogenous encoding inside the image. So a value 
>> of a character is always unicode (The 8bit trouble is solved because 8bit 
>> now is latin1). There would be only be unicode characters without encoding. 
>> And encoders that produce individual encodings.
>> The "westeners first" approach then leads to a situation that everyone not 
>> living in the latin1 zone (or good ol' 7bit US) has to live with wide 
>> strings.
>> 
>> This should be changed but I can see that's no easy task because the 
>> leadingChar is used throughout the system trying to help in collation 
>> things. It would be a bigger effort to clean this up.
>> 
>> Norbert
>> 
>> 
>> 
>

Re: [Pharo-project] Unicode/Latin1 handling in Pharo 1.3

Reply via email to