Am 05.05.2013 um 10:07 schrieb Holger Hans Peter Freyther <hol...@freyther.de>:

> Hi,
> 
> when I port a project to GNU Smalltalk I tend to use the snapshot/*.st
> and convert it. Now with some MCZ versions of Aida/Iliad this is failing
> because the fileout is broken. The fileout is broken in a way that at
> some point (without a BOM) the creator started to use UCS-4 (or such) for
> the strings.
> 
> This can be seen here[1] and either manually extracting the source and
> using FileStream>>#fileIn: or by using the MczInstaller on the mcz file
> (which is not working on the snapshot of the MCDefinition) the import
> will fail.
> 
> Is this a known/fixed problem with Monticello/Pharo? Is there a way to
> re-create the source.st from the snapshot of the MCDefinitions?
> 
Yes, the problem is known. Monticello has no handling for encoding. The last 
time I looked into it I could see that Monticello is assuming a latin1 
encoding. As soon as you include a non-latin1 character in the source it will 
be turned into a WideString. When this is written to disk either our 
UCS-4+leadingChar format is written or even worse in way that every byte of a 
WideString is latin1 encoded then. I'm not sure in any way it isn't the right 
way to do it.
I started to fix this a couple of years ago but as most of the time the problem 
is deeply embedded in the image and grows the longer you look at it. And that 
exceeds my time frame I have for these things massively.

The problem is easier to fix for the .st file because in case of String 
representations ('') or the usage of the String class it is platform 
independent. In the binary blob the platform specific classes like WideString 
appear that make it unreadable on other platforms. Here the canonical way of 
encoding something in utf-8 would also mean that platform dependent class are 
treated in the same canonical way to use only String instead of platform 
dependent ones. A platform that reads a monticello file gets utf-8 decodes it 
and then on occurrence of a wide character would then turn it into a platform 
dependent class, etc.

So, in order to "fix" this I think the only feasible way is to get rid of 
non-latin1 characters in the source and save the package again. This is how it 
is done, e.g. in seaside. If someone really needs some non latin1 characters 
they should be included programmatically, meaning at the right position in code 
use "Character value: …"

Norbert


Reply via email to