Re: [Pharo-project] Unicode/Latin1 handling in Pharo 1.3

Nicolas Cellier Tue, 08 May 2012 05:00:53 -0700

leadingChar was here merely for handling Han unification
http://en.wikipedia.org/wiki/Han_unification.
As I understand it, the language variant is encoded in leadingChar,
while the generic ideogram is encoded in the unicode value.
We tried to clean a lot already and avoid using leadingChar, except
maybe for east-asian languages, so in case of umlaut, I would classify
this as a bug.


Nicolas

2012/5/8 Norbert Hartl <[email protected]>:
>
> Am 08.05.2012 um 11:41 schrieb Sven Van Caekenberghe:
>
>> Hi Holger,
>>
>> --
>> Sven Van Caekenberghe
>> http://stfx.eu
>> Smalltalk is the Red Pill
>>
>>
>>
>>
>> On 08 May 2012, at 10:57, Holger Hans Peter Freyther wrote:
>>
>>> Hi,
>>>
>>> I am trying to read a file from disk that is in latin1 encoding and then try
>>> to compare the string with a string provided as literal and it fails. My 
>>> test
>>> case can be seen below. I am using Pharo 1.3 on a Linux machine with a UTF-8
>>> locale. Is there something obvious that I am doing wrong?
>>>
>>>
>>> | stream text |
>>> stream := (FileStream fileNamed: 'pharo_example_latin1.txt')
>>>                      converter: ISO885915TextConverter new;
>>>                      yourself.
>>> text := stream contents.
>>> text = 'Teilrückzahlung'
>>>
>>> <pharo_example_latin1.txt>
>>
>> This should work:
>>
>> | stream text |
>> stream := (FileStream fileNamed: 
>> '/Users/sven/Desktop/pharo_example_latin1.txt')
>>                       converter: Latin1TextConverter  new;
>>                       yourself.
>> text := stream contents.
>> text = 'Teilrückzahlung'.
>>
>> or this (latest Zn code):
>>
>> | stream text |
>> stream := (FileStream fileNamed: 
>> '/Users/sven/Desktop/pharo_example_latin1.txt')
>>                       binary;
>>                       yourself.
>> text := (ZnCharacterEncoder newForEncoding: 'iso-8859-15')
>>               decodeBytes: stream contents.
>> text = 'Teilrückzahlung'.
>>
>> there seems to be an issue with ISO885915TextConverter, check the umlaut 
>> encoding: it adds something called the leadingChar which I don't think is 
>> needed (I don't know why it even exists).
>>
> Yes, characters are mapped to characters with a leadingChar. Furthermore in 
> Character the leadingChar value alters the output of asInteger, asciiValue, 
> codePoint... because they all use the value of the character that includes 
> leadingChar.But Latin1TextConverter doesn't do it so you get false for
>
> (ISO885915TextConverter new byteToUnicode: $ü) = (Latin1TextConverter new 
> byteToUnicode: $ü)
>
> Because
>
> Character>>#= aCharacter
>        "Primitive. Answer true if the receiver and the argument are the same
>        object (have the same object pointer) and false otherwise. Optional. 
> See
>        Object documentation whatIsAPrimitive."
>
>        ^ self == aCharacter or:[
>                aCharacter isCharacter and: [self asciiValue = aCharacter 
> asciiValue]]
>
> is using asciiValue containing in one case 252 and in the other 71303420 (252 
> + leadingChar)
>
> So leadingChar can be considered harmful. It surely was useful in the 8bit 
> days (that followed the 7bit days). But I personally cannot see why the 
> encoding of character should persist if a character has been created. The 
> approach with leadingChar is keeping the 128 >= asciiValue >= 255 and add the 
> leadingChar for the encoding (comparable to code pages). Another approach 
> would be to have a homogenous encoding inside the image. So a value of a 
> character is always unicode (The 8bit trouble is solved because 8bit now is 
> latin1). There would be only be unicode characters without encoding. And 
> encoders that produce individual encodings.
> The "westeners first" approach then leads to a situation that everyone not 
> living in the latin1 zone (or good ol' 7bit US) has to live with wide strings.
>
> This should be changed but I can see that's no easy task because the 
> leadingChar is used throughout the system trying to help in collation things. 
> It would be a bigger effort to clean this up.
>
> Norbert
>
>
>

Re: [Pharo-project] Unicode/Latin1 handling in Pharo 1.3

Reply via email to