2010/3/30 Norbert Hartl <[email protected]>: > > On 29.03.2010, at 11:52, Nicolas Cellier wrote: > > 2010/3/29 Henrik Johansen <[email protected]>: > > On Mar 29, 2010, at 11:16 30AM, Nicolas Cellier wrote: > > I presume that under the idiom "latin1" you refer to code page 1252 > > rather than iso8859-L1, right ? > > Nicolas > > Good question :) > > What IS the presumed internal encoding of Bytestrings in Squeak? > > That's the one I meant, I merely assumed it was latin1 seeing as how the > text converter refers to it as such. > > Personally I thought it was iso8859-L1, seeing as the bytestring to unicode > conversion does a simple shift of chars > 127 to the 0080 - 00FF range. > > Cheers, > > Henry > > > From what I understood, CP1252 is Microsoft "latin1" and use codes 128 to > 159. > ISO8859-L1 match fisrt 256 codes of unicode latin-1 and has codes 128 > to 159 unused. > You know, when Microsoft "uses" a standard, it's always a better standard ;) > > I have nothing against CP1252, it's an optimization which avoid > wasting 32 cheap codes. > But I'm not sure about various compatibility issues in/with the > external world... > > If you know how to easily assure that > (String with: (Character value: (Integer readFrom: '20AC' base: 16))) > = (String with: (Character value: (Integer readFrom: '80' base: 16))) > than you might be safe. By using Windows-1252 code points aren't unique > anymore. Every code point in the range 0x80 - 0x9F exists somewhere else, > too. So my estimation would be that it will cause more trouble than it might > solve. >
Agree. I see two different problems here: 1) absence of explicit encoding information in external data 2) existence of a canonical representation which can be easily compared... Generalization of UTF8 should solve 1 (slowly with lot of inertia), then we can simply assume implicit=UTF8. Unicode could solve 2... ...Well, as long as diacriticals are ignored. To me Unicode still has problems with: (String with: 16r61 asCharacter with: 16r0302 asCharacter) = (String with: 16rE2 asCharacter) Nicolas > Squeak clearly uses CP1252. > For Pharo, there might be a mix of the two since Sophie-like > refactorings. Surely what John was refering to. > > In pharo the 20AC string gives me a euro sign but the 80 hex one prints a > rectangle which is _a_ interpretation of '?' ;) > Norbert > > _______________________________________________ > > Pharo-project mailing list > > [email protected] > > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > > > _______________________________________________ > Pharo-project mailing list > [email protected] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > > > _______________________________________________ > Pharo-project mailing list > [email protected] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project > _______________________________________________ Pharo-project mailing list [email protected] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
