Re: [Pharo-project] XML Parser, Monticello and unicode?

Norbert Hartl Sat, 07 Aug 2010 04:09:06 -0700

On 06.08.2010, at 14:52, Henrik Johansen wrote:

> On Aug 5, 2010, at 5:37 34PM, Norbert Hartl wrote:
> 
>> I'm trying to port the newest XML Parser from squeaksource to gemstone. In 
>> XML-Parser-AlexandreBergel.73 there is a unicode test introduced with a 
>> longer unicode xml snippet. From this release on I cannot load or merge 
>> anything. 
>> 
>> Besides that the xml snippet looks very strange it loads in pharo but not in 
>> gemstone. Unpacking the mcz on the console and examine the content showed 
>> that the encoding is indeed weird. I don't know what it is but it is neither 
>> ascii nor utf-8. Pharo loaded the snippet into a WideString instance.
>> 
>> How does monticello handle WideString instances when written to a file? 
> 
> Rather randomly. ;)
> 
> If you merely need to export a package so you can import in gemstone, you 
> could change:
> 
> MCMczWriter >> addString: internalString at: path
>       | member utfConverter utfStringStream|
>       utfConverter := TextConverter newForEncoding: 'utf8'. "(Or whatever 
> other format Gemstone thinks .mcz definitions will be)"
>       utfStringStream := RWBinaryOrTextStream on: String new.
>       utfStringStream binary.
>       utfConverter class writeBOMOn: utfStringStream.
>       utfStringStream ascii.
>       utfConverter nextPutAll: internalString toStream: utfStringStream.
>       member := zip addString: utfStringStream contents asString as: path.
>       member desiredCompressionMethod: ZipArchive compressionDeflated 
> 
> (Alternatively use String new writeStream if you don't need/want to write 
> BOM).
> 
> Doing changes like this in the base image is unlikely without further 
> investigation, as it would probably break reading new packages (saved in 
> proper utf8) containing WideStrings into old images.
> I haven't read the import code, but if the binary format is preferred by old 
> images if available, it might be a reasonable compromise saving the source in 
> utf8, provided you also include the binary file.
> 
> Cheers,
> Henry
> 
> PS. Another fun fact I encountered when porting Assets:
> Monticello uses MethodReference>>source, which kindly converts all LF / CRLFs 
> in your source / strings in the source to CR.
> So you can forget f.ex. trying to save arbitrary ByteArrays as strings in 
> your code, and expect them to work the same when converting back to ByteArray 
> after saving to monticello :)
> 
Has this been reported before? If not why? This is really important. I don't 
think we can wait until Monticello is replaced by something different that will 
fix this :)


There are a few things here that work together in 98% of all cases. I didn't 
get it fully what is going on but

ZipArchiveMember>>contentStream does
...
s := MultiByteBinaryOrTextStream on: (String new: self uncompressedSize).
s converter: Latin1TextConverter new.
...

and

MultiByteBinaryOrTextStream>>defaultConverter
        ^ Latin1TextConverter new.

These two are being used when a monticello package is being read. So we have an 
assumption about an encoding here. On the other hand something in the system 
does something similar. I don't know InputEvents and how to debug them but if I 
create a method

EncTest>>encTest
        ^ 'ö'

I can see that

((EncTest>>#encTest literalAt: 1) at: 1) asciiValue

is 246 which is something that matches latin1 to some extent.

This way there is a conversion (I think at the time I press on my keyboard) to 
latin1. While writing a monticello package I didn't find any conversion so this 
might be the reason that the files become latin1 on disk and can be read back 
using an explicit conversion from latin1.

But this does not explain how it does work with WideString. I would need to dig 
deeper but maybe someone of you have an idea.

To estimate the possibility to change this I think we should fix this. I 
scanned all of my cached monticello packages. Most of them are 7bit clean. No 
problem for them if we change encoding. Besides XML Parser I didn't find any 
that contain WideString so no problem here. Some of them are latin1 encoded 
(like Seaside 2.8 or Seaside-InternetExplorer from 3.0). That is the biggest 
problem because there is no fallback and monticello does not have a version 
number on file format, right?
I think it is still feasible to change this in monticello as the fix for users 
of older images will be probably only a few lines that you can apply to any 
version of monticello if I'm not wrong. But the change is not that easy.

Norbert
 
_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Re: [Pharo-project] XML Parser, Monticello and unicode?

Reply via email to