Re: [Pharo-project] squeakToUTF-8 and related?

Stéphane Ducasse Thu, 01 Apr 2010 01:33:51 -0700

On Mar 29, 2010, at 11:10 AM, Henrik Johansen wrote:

> 
> On Mar 28, 2010, at 4:36 13PM, Stéphane Ducasse wrote:
> 
>> Hi 
>> 
>> I'm trying to remember the situation with the internal representation of 
>> string in pharo/squeak
>> to revise 
>> http://book.seaside.st/book/in-action/serving-files/character-encodings/seaside-pharo
>> 
>> I saw that in pharo we have this NonASCIIMap. I do not remember what have 
>> been done in pharo. 
>> Argh memory leaks.... Nicolas do you remember the situation?
> NonASCIIMap is used for quickly determining whether the string with no 
> character codes > 127 (ie only ascii characters).
> It's very useful for doing primitive accellerated isAsciiString, which in the 
> case of ascii-compatible encodings (utf8, latin1, macroman, etc.) would mean 
> no conversion is required for it to be the "appropriate" internal bytestring 
> format. 
> It's used f.ex. in the nextChunk code,


ok thanks

> Strangely it is also used in  FileStream writeSourceCodeFrom: baseName: isSt: 
> , for some reason we there use a MacRoman if stream contents isAscii, which 
> really makes no sense, but whatever.

ok may be levente fixed that in Squeak. 

> 
> John pointed out some converters were lying, I'm not entirely sure that's 
> true anymore, what IS certain though, is the external code format used is 
> inconsistent, depending on from where/how you save/load it.

May be we should wrtie some tests to know what to fix.

> It really should be cleaned up to always store in utf8, and possibly also 
> latin1 if possible.
> All this should be cleared up to always try reading as UTF8, then raising an 
> InvalidUTF8 error which can be handled by telling it to use a different 
> converter and restart. 

ok

> Possibly chosen from a menu when dropping a file on image, or choosing an 
> alternative automatically if we know the possible other encodings a file 
> could have been saved as,  not sure how to best do it for scripts given as 
> parameters when launching the vm
> 
> On the font rendering side, I agree with Nicolas it's too complicated doing 
> font rendering in-image, FT is an ok compromise though.
> As for the bitmap strikefont rendering, what is really needed is a way to 
> specify the charset it represents, and mappings from the internal string 
> encodings to its glyphs.
> F.ex., Bitmap DejaVu is really latin15, so it will currently render some 
> ByteString characters incorrectly, as well as render some Unicode chars it 
> really has glyphs for as ?. (such as the euro sign)
> 
> Which all really has nothing to do with your initial question :)

no problem I like to learn.

> The internal representation of strings really hasn't changed since it was 
> written, with the exception that leadingChar for WideStrings are now zero. 
> As far as I can tell, that means the interal storage format of widestrings is 
> now equivalent to utf32, not sure what Byte Order it uses though, or if that 
> is even consistent across platforms. :)
> 
> The point about using WaKomEncoded, and passing all strings going into/out of 
> the image through an encoder is still valid.
> 
> Cheers,
> Henry
> _______________________________________________
> Pharo-project mailing list
> [email protected]
> http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project


_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Re: [Pharo-project] squeakToUTF-8 and related?

Reply via email to