Re: [Pharo-project] squeakToUTF-8 and related?

Henrik Johansen Mon, 29 Mar 2010 02:11:16 -0700

On Mar 28, 2010, at 4:36 13PM, Stéphane Ducasse wrote:

> Hi 
> 
> I'm trying to remember the situation with the internal representation of 
> string in pharo/squeak
> to revise 
> http://book.seaside.st/book/in-action/serving-files/character-encodings/seaside-pharo
> 
> I saw that in pharo we have this NonASCIIMap. I do not remember what have 
> been done in pharo. 
> Argh memory leaks.... Nicolas do you remember the situation?
NonASCIIMap is used for quickly determining whether the string with no 
character codes > 127 (ie only ascii characters).
It's very useful for doing primitive accellerated isAsciiString, which in the 
case of ascii-compatible encodings (utf8, latin1, macroman, etc.) would mean no 
conversion is required for it to be the "appropriate" internal bytestring 
format. 
It's used f.ex. in the nextChunk code,
Strangely it is also used in  FileStream writeSourceCodeFrom: baseName: isSt: , 
for some reason we there use a MacRoman if stream contents isAscii, which 
really makes no sense, but whatever.


John pointed out some converters were lying, I'm not entirely sure that's true 
anymore, what IS certain though, is the external code format used is 
inconsistent, depending on from where/how you save/load it.
It really should be cleaned up to always store in utf8, and possibly also 
latin1 if possible.
All this should be cleared up to always try reading as UTF8, then raising an 
InvalidUTF8 error which can be handled by telling it to use a different 
converter and restart. 
Possibly chosen from a menu when dropping a file on image, or choosing an 
alternative automatically if we know the possible other encodings a file could 
have been saved as,  not sure how to best do it for scripts given as parameters 
when launching the vm

On the font rendering side, I agree with Nicolas it's too complicated doing 
font rendering in-image, FT is an ok compromise though.
As for the bitmap strikefont rendering, what is really needed is a way to 
specify the charset it represents, and mappings from the internal string 
encodings to its glyphs.
F.ex., Bitmap DejaVu is really latin15, so it will currently render some 
ByteString characters incorrectly, as well as render some Unicode chars it 
really has glyphs for as ?. (such as the euro sign)

Which all really has nothing to do with your initial question :)
The internal representation of strings really hasn't changed since it was 
written, with the exception that leadingChar for WideStrings are now zero. 
As far as I can tell, that means the interal storage format of widestrings is 
now equivalent to utf32, not sure what Byte Order it uses though, or if that is 
even consistent across platforms. :)

The point about using WaKomEncoded, and passing all strings going into/out of 
the image through an encoder is still valid.

Cheers,
Henry
_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Re: [Pharo-project] squeakToUTF-8 and related?

Reply via email to