On Mar 28, 2010, at 4:36 13PM, Stéphane Ducasse wrote: > Hi > > I'm trying to remember the situation with the internal representation of > string in pharo/squeak > to revise > http://book.seaside.st/book/in-action/serving-files/character-encodings/seaside-pharo > > I saw that in pharo we have this NonASCIIMap. I do not remember what have > been done in pharo. > Argh memory leaks.... Nicolas do you remember the situation? NonASCIIMap is used for quickly determining whether the string with no character codes > 127 (ie only ascii characters). It's very useful for doing primitive accellerated isAsciiString, which in the case of ascii-compatible encodings (utf8, latin1, macroman, etc.) would mean no conversion is required for it to be the "appropriate" internal bytestring format. It's used f.ex. in the nextChunk code, Strangely it is also used in FileStream writeSourceCodeFrom: baseName: isSt: , for some reason we there use a MacRoman if stream contents isAscii, which really makes no sense, but whatever.
John pointed out some converters were lying, I'm not entirely sure that's true anymore, what IS certain though, is the external code format used is inconsistent, depending on from where/how you save/load it. It really should be cleaned up to always store in utf8, and possibly also latin1 if possible. All this should be cleared up to always try reading as UTF8, then raising an InvalidUTF8 error which can be handled by telling it to use a different converter and restart. Possibly chosen from a menu when dropping a file on image, or choosing an alternative automatically if we know the possible other encodings a file could have been saved as, not sure how to best do it for scripts given as parameters when launching the vm On the font rendering side, I agree with Nicolas it's too complicated doing font rendering in-image, FT is an ok compromise though. As for the bitmap strikefont rendering, what is really needed is a way to specify the charset it represents, and mappings from the internal string encodings to its glyphs. F.ex., Bitmap DejaVu is really latin15, so it will currently render some ByteString characters incorrectly, as well as render some Unicode chars it really has glyphs for as ?. (such as the euro sign) Which all really has nothing to do with your initial question :) The internal representation of strings really hasn't changed since it was written, with the exception that leadingChar for WideStrings are now zero. As far as I can tell, that means the interal storage format of widestrings is now equivalent to utf32, not sure what Byte Order it uses though, or if that is even consistent across platforms. :) The point about using WaKomEncoded, and passing all strings going into/out of the image through an encoder is still valid. Cheers, Henry _______________________________________________ Pharo-project mailing list [email protected] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
