leadingChar was here merely for handling Han unification http://en.wikipedia.org/wiki/Han_unification. As I understand it, the language variant is encoded in leadingChar, while the generic ideogram is encoded in the unicode value. We tried to clean a lot already and avoid using leadingChar, except maybe for east-asian languages, so in case of umlaut, I would classify this as a bug.
Nicolas 2012/5/8 Norbert Hartl <[email protected]>: > > Am 08.05.2012 um 11:41 schrieb Sven Van Caekenberghe: > >> Hi Holger, >> >> -- >> Sven Van Caekenberghe >> http://stfx.eu >> Smalltalk is the Red Pill >> >> >> >> >> On 08 May 2012, at 10:57, Holger Hans Peter Freyther wrote: >> >>> Hi, >>> >>> I am trying to read a file from disk that is in latin1 encoding and then try >>> to compare the string with a string provided as literal and it fails. My >>> test >>> case can be seen below. I am using Pharo 1.3 on a Linux machine with a UTF-8 >>> locale. Is there something obvious that I am doing wrong? >>> >>> >>> | stream text | >>> stream := (FileStream fileNamed: 'pharo_example_latin1.txt') >>> converter: ISO885915TextConverter new; >>> yourself. >>> text := stream contents. >>> text = 'Teilrückzahlung' >>> >>> <pharo_example_latin1.txt> >> >> This should work: >> >> | stream text | >> stream := (FileStream fileNamed: >> '/Users/sven/Desktop/pharo_example_latin1.txt') >> converter: Latin1TextConverter new; >> yourself. >> text := stream contents. >> text = 'Teilrückzahlung'. >> >> or this (latest Zn code): >> >> | stream text | >> stream := (FileStream fileNamed: >> '/Users/sven/Desktop/pharo_example_latin1.txt') >> binary; >> yourself. >> text := (ZnCharacterEncoder newForEncoding: 'iso-8859-15') >> decodeBytes: stream contents. >> text = 'Teilrückzahlung'. >> >> there seems to be an issue with ISO885915TextConverter, check the umlaut >> encoding: it adds something called the leadingChar which I don't think is >> needed (I don't know why it even exists). >> > Yes, characters are mapped to characters with a leadingChar. Furthermore in > Character the leadingChar value alters the output of asInteger, asciiValue, > codePoint... because they all use the value of the character that includes > leadingChar.But Latin1TextConverter doesn't do it so you get false for > > (ISO885915TextConverter new byteToUnicode: $ü) = (Latin1TextConverter new > byteToUnicode: $ü) > > Because > > Character>>#= aCharacter > "Primitive. Answer true if the receiver and the argument are the same > object (have the same object pointer) and false otherwise. Optional. > See > Object documentation whatIsAPrimitive." > > ^ self == aCharacter or:[ > aCharacter isCharacter and: [self asciiValue = aCharacter > asciiValue]] > > is using asciiValue containing in one case 252 and in the other 71303420 (252 > + leadingChar) > > So leadingChar can be considered harmful. It surely was useful in the 8bit > days (that followed the 7bit days). But I personally cannot see why the > encoding of character should persist if a character has been created. The > approach with leadingChar is keeping the 128 >= asciiValue >= 255 and add the > leadingChar for the encoding (comparable to code pages). Another approach > would be to have a homogenous encoding inside the image. So a value of a > character is always unicode (The 8bit trouble is solved because 8bit now is > latin1). There would be only be unicode characters without encoding. And > encoders that produce individual encodings. > The "westeners first" approach then leads to a situation that everyone not > living in the latin1 zone (or good ol' 7bit US) has to live with wide strings. > > This should be changed but I can see that's no easy task because the > leadingChar is used throughout the system trying to help in collation things. > It would be a bigger effort to clean this up. > > Norbert > > >
