Am 08.05.2012 um 14:00 schrieb Nicolas Cellier: > leadingChar was here merely for handling Han unification > http://en.wikipedia.org/wiki/Han_unification. > As I understand it, the language variant is encoded in leadingChar, > while the generic ideogram is encoded in the unicode value. > We tried to clean a lot already and avoid using leadingChar, except > maybe for east-asian languages, so in case of umlaut, I would classify > this as a bug. > Ok, so there is a real tension to get rid of leadingChar. Would be good then to have Character>>#= using charCode instead of asciiValue?
Norbert > 2012/5/8 Norbert Hartl <[email protected]>: >> >> Am 08.05.2012 um 11:41 schrieb Sven Van Caekenberghe: >> >>> Hi Holger, >>> >>> -- >>> Sven Van Caekenberghe >>> http://stfx.eu >>> Smalltalk is the Red Pill >>> >>> >>> >>> >>> On 08 May 2012, at 10:57, Holger Hans Peter Freyther wrote: >>> >>>> Hi, >>>> >>>> I am trying to read a file from disk that is in latin1 encoding and then >>>> try >>>> to compare the string with a string provided as literal and it fails. My >>>> test >>>> case can be seen below. I am using Pharo 1.3 on a Linux machine with a >>>> UTF-8 >>>> locale. Is there something obvious that I am doing wrong? >>>> >>>> >>>> | stream text | >>>> stream := (FileStream fileNamed: 'pharo_example_latin1.txt') >>>> converter: ISO885915TextConverter new; >>>> yourself. >>>> text := stream contents. >>>> text = 'Teilrückzahlung' >>>> >>>> <pharo_example_latin1.txt> >>> >>> This should work: >>> >>> | stream text | >>> stream := (FileStream fileNamed: >>> '/Users/sven/Desktop/pharo_example_latin1.txt') >>> converter: Latin1TextConverter new; >>> yourself. >>> text := stream contents. >>> text = 'Teilrückzahlung'. >>> >>> or this (latest Zn code): >>> >>> | stream text | >>> stream := (FileStream fileNamed: >>> '/Users/sven/Desktop/pharo_example_latin1.txt') >>> binary; >>> yourself. >>> text := (ZnCharacterEncoder newForEncoding: 'iso-8859-15') >>> decodeBytes: stream contents. >>> text = 'Teilrückzahlung'. >>> >>> there seems to be an issue with ISO885915TextConverter, check the umlaut >>> encoding: it adds something called the leadingChar which I don't think is >>> needed (I don't know why it even exists). >>> >> Yes, characters are mapped to characters with a leadingChar. Furthermore in >> Character the leadingChar value alters the output of asInteger, asciiValue, >> codePoint... because they all use the value of the character that includes >> leadingChar.But Latin1TextConverter doesn't do it so you get false for >> >> (ISO885915TextConverter new byteToUnicode: $ü) = (Latin1TextConverter new >> byteToUnicode: $ü) >> >> Because >> >> Character>>#= aCharacter >> "Primitive. Answer true if the receiver and the argument are the same >> object (have the same object pointer) and false otherwise. Optional. >> See >> Object documentation whatIsAPrimitive." >> >> ^ self == aCharacter or:[ >> aCharacter isCharacter and: [self asciiValue = aCharacter >> asciiValue]] >> >> is using asciiValue containing in one case 252 and in the other 71303420 >> (252 + leadingChar) >> >> So leadingChar can be considered harmful. It surely was useful in the 8bit >> days (that followed the 7bit days). But I personally cannot see why the >> encoding of character should persist if a character has been created. The >> approach with leadingChar is keeping the 128 >= asciiValue >= 255 and add >> the leadingChar for the encoding (comparable to code pages). Another >> approach would be to have a homogenous encoding inside the image. So a value >> of a character is always unicode (The 8bit trouble is solved because 8bit >> now is latin1). There would be only be unicode characters without encoding. >> And encoders that produce individual encodings. >> The "westeners first" approach then leads to a situation that everyone not >> living in the latin1 zone (or good ol' 7bit US) has to live with wide >> strings. >> >> This should be changed but I can see that's no easy task because the >> leadingChar is used throughout the system trying to help in collation >> things. It would be a bigger effort to clean this up. >> >> Norbert >> >> >> >
