============ Forwarded message ============ From : jaayer<[email protected]> To : <[email protected]> Date : Tue, 18 May 2010 16:30:06 -0700 Subject : Re: Decoding bug with XMLParser ? ============ Forwarded message ============
---- On Tue, 18 May 2010 02:29:18 -0700 Alexandre Bergel <[email protected]> wrote ---- >To give a bit of context, the problem is: > >-=-=-=-=-=-=-=-=-=-=-=-= >exampleEncodedXML > ^'<?xml version="1.0" encoding="UTF-8"?> ><test-data>…</test-data> >' > >testDecodingCharacters > | xmlDocument element | > "XMLTokenizer testDecodingCharacters" > > xmlDocument := XMLDOMParser parseDocumentFrom: self exampleEncodedXML >readStream. > element := xmlDocument firstTagNamed: #'test-data'. > > self assert: element contentString first codePoint = 8230 >-=-=-=-=-=-=-=-=-=-=-=-= > >#testDecodingCharacters goes yellow > >> Thinking of it, it's not really an encoding problem, rather a bug in >> the entity->character conversion. I guess there should be a similar >> test where there is an actual ellipsis character in the xml, instead >> of the entity. > >Any idea how your test can goes green? > >> And now I realize our server will not be able to connect outside its >> DMZ, so I won't be able to use the fix :D > >DMZ ? > >Cheers, >Alexandre > Character references like the one above are handled using #nextCharReference. It does so by reading the number after the "&#" or "&x" prefix and then sending #value: to the class Unicode with that as the argument. If you evaluate the following code in a workspace with cmd-p: "(Unicode value: 8230) codePoint", you will see that the resulting code point is not what you would expect. For me it was "1069555750". The same behavior results when creating a Unicode character with #charFromUnicode:. Unless Unicode>>value: and Unicode>>charFromUnicode: are being used incorrectly, I am not sure that this is a bug, or least a bug in XML-Support. (I am working on adding full DTD support with validation and refactoring and re-engineering the parser at the moment, which is why minor releases have slowed to a trickle. I will take a closer look at how character encoding is handled in the process.) _______________________________________________ Pharo-project mailing list [email protected] http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project
