Hi, On Thu, 2014-11-20 at 16:47 +0100, Henrik Johansen wrote: > > On 20 Nov 2014, at 2:39 , Jan Vrany <[email protected]> wrote: > > > > > > But as I said, I'm more interested in 'low level' details like I > > mentioned: > > > > - encoding of the source string > > > > > > Best, Jan > > IIRC, the .bin is the entire source string in Datastream-format, that > is, is a datatype identifier (either ByteString or WideString), > followed by the raw bytes/words (which is pure Latin1 if ByteString, > UTF32-BE(?) for WideString, at least since leadingChar of the standard > Unicode locale was changed to 0). So writing an encoder/decoder > strictly for use with MCZ's isn't a very big task. (this is what > gemstone does)
That's what I do as well, but was not 100% sure about the encoding. Thanks! > > The pure text file (.sources) is only used as a fallback** when > importing code where the .bin is corrupted/absent, it should either be > pure latin-1, or UTF-8*. > OK, thanks. I do not use .sources file, only .bin :-) But good to know it should be UTF8 Thanks! > Cheers, > Henry > > * Not sure if it ended up being solved using a BOM-marker for UTF8 (as in the > .cs format), or if a UTF8Encoder is used by default, with a fallback to > latin1 if an incorrect utf8 character is encountered. > ** Ironically, the string export was bugged up until recently, causing lots > of confusion when non-latin1 .mcz exported/imported just fine in > Squeak/Pharo, but failed to import elsewhere (where the .bin reading was not > implemented) >
