On 19.05.2010 02:17, jaayer wrote:

============ Forwarded message ============
 From : jaayer<[email protected]>
To :<[email protected]>
Date : Tue, 18 May 2010 16:30:06 -0700
Subject : Re: Decoding bug with XMLParser ?
============ Forwarded message ============

---- On Tue, 18 May 2010 02:29:18 -0700 Alexandre 
Bergel<[email protected]>  wrote ----

To give a bit of context, the problem is:

-=-=-=-=-=-=-=-=-=-=-=-=
exampleEncodedXML
     ^'<?xml version="1.0" encoding="UTF-8"?>
<test-data>&#8230;</test-data>
'

testDecodingCharacters
     | xmlDocument element |
     "XMLTokenizer testDecodingCharacters"

     xmlDocument := XMLDOMParser parseDocumentFrom: self exampleEncodedXML 
readStream.
     element := xmlDocument firstTagNamed: #'test-data'.
self assert: element contentString first codePoint = 8230
-=-=-=-=-=-=-=-=-=-=-=-=

#testDecodingCharacters goes yellow

Thinking of it, it's not really an encoding problem, rather a bug in
the entity->character conversion. I guess there should be a similar
test where there is an actual ellipsis character in the xml, instead
of the entity.
Any idea how your test can goes green?

And now I realize our server will not be able to connect outside its
DMZ, so I won't be able to use the fix :D
DMZ ?

Cheers,
Alexandre

Character references like the one above are handled using #nextCharReference. It does so by reading the number after the "&#" 
or "&x" prefix and then sending #value: to the class Unicode with that as the argument. If you evaluate the following code in 
a workspace with cmd-p: "(Unicode value: 8230) codePoint", you will see that the resulting code point is not what you would 
expect. For me it was "1069555750". The same behavior results when creating a Unicode character with #charFromUnicode:. Unless 
Unicode>>value: and Unicode>>charFromUnicode: are being used incorrectly, I am not sure that this is a bug, or least a bug in 
XML-Support.

(I am working on adding full DTD support with validation and refactoring and 
re-engineering the parser at the moment, which is why minor releases have 
slowed to a trickle. I will take a closer look at how character encoding is 
handled in the process.)
codePoint returns the raw value, which includes the leadingChar used to differentiate between different locale interpretations of the same character.
In 1.0 this was 255 for WideCharacters, in 1.1 it has been changed to 0.
ie, using codePoint in the test is erroneous, for a method which returns what you expect in both 1.0 and 1.1, use asUnicode.

Cheers,
Henry


_______________________________________________
Pharo-project mailing list
[email protected]
http://lists.gforge.inria.fr/cgi-bin/mailman/listinfo/pharo-project

Reply via email to