Am 08.05.2012 um 11:41 schrieb Sven Van Caekenberghe:

> Hi Holger,
> 
> --
> Sven Van Caekenberghe
> http://stfx.eu
> Smalltalk is the Red Pill
> 
> 
> 
> 
> On 08 May 2012, at 10:57, Holger Hans Peter Freyther wrote:
> 
>> Hi,
>> 
>> I am trying to read a file from disk that is in latin1 encoding and then try
>> to compare the string with a string provided as literal and it fails. My test
>> case can be seen below. I am using Pharo 1.3 on a Linux machine with a UTF-8
>> locale. Is there something obvious that I am doing wrong?
>> 
>> 
>> | stream text |
>> stream := (FileStream fileNamed: 'pharo_example_latin1.txt')
>>                      converter: ISO885915TextConverter new;
>>                      yourself.
>> text := stream contents.
>> text = 'Teilrückzahlung'
>>                      
>> <pharo_example_latin1.txt>
> 
> This should work:
> 
> | stream text |
> stream := (FileStream fileNamed: 
> '/Users/sven/Desktop/pharo_example_latin1.txt')
>                       converter: Latin1TextConverter  new;
>                       yourself.
> text := stream contents.
> text = 'Teilrückzahlung'.
> 
> or this (latest Zn code):
> 
> | stream text |
> stream := (FileStream fileNamed: 
> '/Users/sven/Desktop/pharo_example_latin1.txt')
>                       binary;
>                       yourself.
> text := (ZnCharacterEncoder newForEncoding: 'iso-8859-15')
>               decodeBytes: stream contents.
> text = 'Teilrückzahlung'.
> 
> there seems to be an issue with ISO885915TextConverter, check the umlaut 
> encoding: it adds something called the leadingChar which I don't think is 
> needed (I don't know why it even exists).
> 
Yes, characters are mapped to characters with a leadingChar. Furthermore in 
Character the leadingChar value alters the output of asInteger, asciiValue, 
codePoint... because they all use the value of the character that includes 
leadingChar.But Latin1TextConverter doesn't do it so you get false for

(ISO885915TextConverter new byteToUnicode: $ü) = (Latin1TextConverter new 
byteToUnicode: $ü)

Because 

Character>>#= aCharacter 
        "Primitive. Answer true if the receiver and the argument are the same
        object (have the same object pointer) and false otherwise. Optional. See
        Object documentation whatIsAPrimitive."

        ^ self == aCharacter or:[
                aCharacter isCharacter and: [self asciiValue = aCharacter 
asciiValue]]

is using asciiValue containing in one case 252 and in the other 71303420 (252 + 
leadingChar)

So leadingChar can be considered harmful. It surely was useful in the 8bit days 
(that followed the 7bit days). But I personally cannot see why the encoding of 
character should persist if a character has been created. The approach with 
leadingChar is keeping the 128 >= asciiValue >= 255 and add the leadingChar for 
the encoding (comparable to code pages). Another approach would be to have a 
homogenous encoding inside the image. So a value of a character is always 
unicode (The 8bit trouble is solved because 8bit now is latin1). There would be 
only be unicode characters without encoding. And encoders that produce 
individual encodings.
The "westeners first" approach then leads to a situation that everyone not 
living in the latin1 zone (or good ol' 7bit US) has to live with wide strings.

This should be changed but I can see that's no easy task because the 
leadingChar is used throughout the system trying to help in collation things. 
It would be a bigger effort to clean this up.

Norbert



Reply via email to