On Mar 29, 2010, at 11:10 AM, Henrik Johansen wrote: > > On Mar 28, 2010, at 4:36 13PM, Stéphane Ducasse wrote: > >> Hi >> >> I'm trying to remember the situation with the internal representation of >> string in pharo/squeak >> to revise >> http://book.seaside.st/book/in-action/serving-files/character-encodings/seaside-pharo >> >> I saw that in pharo we have this NonASCIIMap. I do not remember what have >> been done in pharo. >> Argh memory leaks.... Nicolas do you remember the situation? > NonASCIIMap is used for quickly determining whether the string with no > character codes > 127 (ie only ascii characters). > It's very useful for doing primitive accellerated isAsciiString, which in the > case of ascii-compatible encodings (utf8, latin1, macroman, etc.) would mean > no conversion is required for it to be the "appropriate" internal bytestring > format. > It's used f.ex. in the nextChunk code,
ok thanks > Strangely it is also used in FileStream writeSourceCodeFrom: baseName: isSt: > , for some reason we there use a MacRoman if stream contents isAscii, which > really makes no sense, but whatever. ok may be levente fixed that in Squeak. > > John pointed out some converters were lying, I'm not entirely sure that's > true anymore, what IS certain though, is the external code format used is > inconsistent, depending on from where/how you save/load it. May be we should wrtie some tests to know what to fix. > It really should be cleaned up to always store in utf8, and possibly also > latin1 if possible. > All this should be cleared up to always try reading as UTF8, then raising an > InvalidUTF8 error which can be handled by telling it to use a different > converter and restart. ok > Possibly chosen from a menu when dropping a file on image, or choosing an > alternative automatically if we know the possible other encodings a file > could have been saved as, not sure how to best do it for scripts given as > parameters when launching the vm > > On the font rendering side, I agree with Nicolas it's too complicated doing > font rendering in-image, FT is an ok compromise though. > As for the bitmap strikefont rendering, what is really needed is a way to > specify the charset it represents, and mappings from the internal string > encodings to its glyphs. > F.ex., Bitmap DejaVu is really latin15, so it will currently render some > ByteString characters incorrectly, as well as render some Unicode chars it > really has glyphs for as ?. (such as the euro sign) > > Which all really has nothing to do with your initial question :) no problem I like to learn. > The internal representation of strings really hasn't changed since it was > written, with the exception that leadingChar for WideStrings are now zero. > As far as I can tell, that means the interal storage format of widestrings is > now equivalent to utf32, not sure what Byte Order it uses though, or if that > is even consistent across platforms. :) > > The point about using WaKomEncoded, and passing all strings going into/out of > the image through an encoder is still valid. > > Cheers, > Henry > _______________________________________________ > Pharo-project mailing list > [email protected] > http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project _______________________________________________ Pharo-project mailing list [email protected] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
