Am 08.05.2012 um 11:41 schrieb Sven Van Caekenberghe:
> Hi Holger,
>
> --
> Sven Van Caekenberghe
> http://stfx.eu
> Smalltalk is the Red Pill
>
>
>
>
> On 08 May 2012, at 10:57, Holger Hans Peter Freyther wrote:
>
>> Hi,
>>
>> I am trying to read a file from disk that is in latin1 encoding and then try
>> to compare the string with a string provided as literal and it fails. My test
>> case can be seen below. I am using Pharo 1.3 on a Linux machine with a UTF-8
>> locale. Is there something obvious that I am doing wrong?
>>
>>
>> | stream text |
>> stream := (FileStream fileNamed: 'pharo_example_latin1.txt')
>> converter: ISO885915TextConverter new;
>> yourself.
>> text := stream contents.
>> text = 'Teilrückzahlung'
>>
>> <pharo_example_latin1.txt>
>
> This should work:
>
> | stream text |
> stream := (FileStream fileNamed:
> '/Users/sven/Desktop/pharo_example_latin1.txt')
> converter: Latin1TextConverter new;
> yourself.
> text := stream contents.
> text = 'Teilrückzahlung'.
>
> or this (latest Zn code):
>
> | stream text |
> stream := (FileStream fileNamed:
> '/Users/sven/Desktop/pharo_example_latin1.txt')
> binary;
> yourself.
> text := (ZnCharacterEncoder newForEncoding: 'iso-8859-15')
> decodeBytes: stream contents.
> text = 'Teilrückzahlung'.
>
> there seems to be an issue with ISO885915TextConverter, check the umlaut
> encoding: it adds something called the leadingChar which I don't think is
> needed (I don't know why it even exists).
>
Yes, characters are mapped to characters with a leadingChar. Furthermore in
Character the leadingChar value alters the output of asInteger, asciiValue,
codePoint... because they all use the value of the character that includes
leadingChar.But Latin1TextConverter doesn't do it so you get false for
(ISO885915TextConverter new byteToUnicode: $ü) = (Latin1TextConverter new
byteToUnicode: $ü)
Because
Character>>#= aCharacter
"Primitive. Answer true if the receiver and the argument are the same
object (have the same object pointer) and false otherwise. Optional. See
Object documentation whatIsAPrimitive."
^ self == aCharacter or:[
aCharacter isCharacter and: [self asciiValue = aCharacter
asciiValue]]
is using asciiValue containing in one case 252 and in the other 71303420 (252 +
leadingChar)
So leadingChar can be considered harmful. It surely was useful in the 8bit days
(that followed the 7bit days). But I personally cannot see why the encoding of
character should persist if a character has been created. The approach with
leadingChar is keeping the 128 >= asciiValue >= 255 and add the leadingChar for
the encoding (comparable to code pages). Another approach would be to have a
homogenous encoding inside the image. So a value of a character is always
unicode (The 8bit trouble is solved because 8bit now is latin1). There would be
only be unicode characters without encoding. And encoders that produce
individual encodings.
The "westeners first" approach then leads to a situation that everyone not
living in the latin1 zone (or good ol' 7bit US) has to live with wide strings.
This should be changed but I can see that's no easy task because the
leadingChar is used throughout the system trying to help in collation things.
It would be a bigger effort to clean this up.
Norbert