On 30.03.2010, at 11:00, Nicolas Cellier wrote: > 2010/3/30 Norbert Hartl <[email protected]>: >> >> On 29.03.2010, at 11:52, Nicolas Cellier wrote: >> >> 2010/3/29 Henrik Johansen <[email protected]>: >> >> On Mar 29, 2010, at 11:16 30AM, Nicolas Cellier wrote: >> >> I presume that under the idiom "latin1" you refer to code page 1252 >> >> rather than iso8859-L1, right ? >> >> Nicolas >> >> Good question :) >> >> What IS the presumed internal encoding of Bytestrings in Squeak? >> >> That's the one I meant, I merely assumed it was latin1 seeing as how the >> text converter refers to it as such. >> >> Personally I thought it was iso8859-L1, seeing as the bytestring to unicode >> conversion does a simple shift of chars > 127 to the 0080 - 00FF range. >> >> Cheers, >> >> Henry >> >> >> From what I understood, CP1252 is Microsoft "latin1" and use codes 128 to >> 159. >> ISO8859-L1 match fisrt 256 codes of unicode latin-1 and has codes 128 >> to 159 unused. >> You know, when Microsoft "uses" a standard, it's always a better standard ;) >> >> I have nothing against CP1252, it's an optimization which avoid >> wasting 32 cheap codes. >> But I'm not sure about various compatibility issues in/with the >> external world... >> >> If you know how to easily assure that >> (String with: (Character value: (Integer readFrom: '20AC' base: 16))) >> = (String with: (Character value: (Integer readFrom: '80' base: 16))) >> than you might be safe. By using Windows-1252 code points aren't unique >> anymore. Every code point in the range 0x80 - 0x9F exists somewhere else, >> too. So my estimation would be that it will cause more trouble than it might >> solve. >> > > Agree. > I see two different problems here: > 1) absence of explicit encoding information in external data > 2) existence of a canonical representation which can be easily compared... > > Generalization of UTF8 should solve 1 (slowly with lot of inertia), > then we can simply assume implicit=UTF8. > Unicode could solve 2... > ...Well, as long as diacriticals are ignored. > To me Unicode still has problems with: > (String with: 16r61 asCharacter with: 16r0302 asCharacter) = (String > with: 16rE2 asCharacter) > Oh well, I forgot about this. There are less chances to get this right without changing a lot of stuff. In my opinion Character has to go the way of the SmallInteger. If the world will be unicode centric than a character needs to be a sequence of code points. A character that has only one code point will be the special case that needs to be optimized, that will resemble what Character how it is now. Having those sequences you will still need to have a table that states the equality of the code point sequence and the 8bit equivalent of e.g. â. But this is due to the western centric specification of unicode. And we have to live with that.
Another 2 cents, Norbert > Nicolas > >> Squeak clearly uses CP1252. >> For Pharo, there might be a mix of the two since Sophie-like >> refactorings. Surely what John was refering to. >> >> In pharo the 20AC string gives me a euro sign but the 80 hex one prints a >> rectangle which is _a_ interpretation of '?' ;) >> Norbert >> >> _______________________________________________ >> >> Pharo-project mailing list >> >> [email protected] >> >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >> >> >> _______________________________________________ >> Pharo-project mailing list >> [email protected] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >> >> >> _______________________________________________ >> Pharo-project mailing list >> [email protected] >> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project >> > > _______________________________________________ > Pharo-project mailing list > [email protected] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [email protected] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
