Hi, 

On Thu, 2014-11-20 at 16:47 +0100, Henrik Johansen wrote:
> > On 20 Nov 2014, at 2:39 , Jan Vrany <[email protected]> wrote:
> > 
> > 
> > But as I said, I'm more interested in 'low level' details like I
> > mentioned: 
> > 
> > - encoding of the source string
> >  
> > 
> > Best, Jan
> 
> IIRC, the .bin is the entire source string in Datastream-format, that
> is, is a datatype identifier (either ByteString or WideString),
> followed by the raw bytes/words (which is pure Latin1 if ByteString,
> UTF32-BE(?) for WideString, at least since leadingChar of the standard
> Unicode locale was changed to 0). So writing an encoder/decoder
> strictly for use with MCZ's isn't a very big task. (this is what
> gemstone does)

That's what I do as well, but was not 100% sure about the encoding. 
Thanks! 

> 
> The pure text file (.sources) is only used as a fallback** when
> importing code where the .bin is corrupted/absent, it should either be
> pure latin-1, or UTF-8*.
> 

OK, thanks. I do not use .sources file, only .bin :-) But good to know
it should be UTF8

Thanks! 

> Cheers,
> Henry
> 
> * Not sure if it ended up being solved using a BOM-marker for UTF8 (as in the 
> .cs format), or if a UTF8Encoder is used by default, with a fallback to 
> latin1 if an incorrect utf8 character is encountered. 
> ** Ironically, the string export was bugged up until recently, causing lots 
> of confusion when non-latin1 .mcz exported/imported just fine in 
> Squeak/Pharo, but failed to import elsewhere (where the .bin reading was not 
> implemented)
> 



Reply via email to