============ Forwarded message ============
From : jaayer<[email protected]>
To :  <[email protected]>
Date : Tue, 18 May 2010 16:30:06 -0700
Subject : Re: Decoding bug with XMLParser ?
============ Forwarded message ============

---- On Tue, 18 May 2010 02:29:18 -0700 Alexandre Bergel 
<[email protected]> wrote ---- 

>To give a bit of context, the problem is: 
> 
>-=-=-=-=-=-=-=-=-=-=-=-= 
>exampleEncodedXML 
>    ^'<?xml version="1.0" encoding="UTF-8"?> 
><test-data>&#8230;</test-data> 
>' 
> 
>testDecodingCharacters 
>    | xmlDocument element | 
>    "XMLTokenizer testDecodingCharacters" 
> 
>    xmlDocument := XMLDOMParser parseDocumentFrom: self exampleEncodedXML 
>readStream. 
>    element := xmlDocument firstTagNamed: #'test-data'. 
>     
>    self assert: element contentString first codePoint = 8230 
>-=-=-=-=-=-=-=-=-=-=-=-= 
> 
>#testDecodingCharacters goes yellow 
> 
>> Thinking of it, it's not really an encoding problem, rather a bug in 
>> the entity->character conversion. I guess there should be a similar 
>> test where there is an actual ellipsis character in the xml, instead 
>> of the entity. 
> 
>Any idea how your test can goes green? 
> 
>> And now I realize our server will not be able to connect outside its 
>> DMZ, so I won't be able to use the fix :D 
> 
>DMZ ? 
> 
>Cheers, 
>Alexandre 
>

Character references like the one above are handled using #nextCharReference. 
It does so by reading the number after the "&#" or "&x" prefix and then sending 
#value: to the class Unicode with that as the argument. If you evaluate the 
following code in a workspace with cmd-p: "(Unicode value: 8230) codePoint", 
you will see that the resulting code point is not what you would expect. For me 
it was "1069555750". The same behavior results when creating a Unicode 
character with #charFromUnicode:. Unless Unicode>>value: and 
Unicode>>charFromUnicode: are being used incorrectly, I am not sure that this 
is a bug, or least a bug in XML-Support.

(I am working on adding full DTD support with validation and refactoring and 
re-engineering the parser at the moment, which is why minor releases have 
slowed to a trickle. I will take a closer look at how character encoding is 
handled in the process.)


_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Reply via email to