The handling of encoding is not coherent in the extension, as my last patch on the topic illustrates. While I have no doubt that there are issues to resolve, in this particular instance I do not get the result you do.

Anyone wanting to look at the way encoding is handled is welcome to make a recommendation.

Dan

On Nov 27, 2007, at 11:41, Paul Dlug wrote:

There is a serious inconsistency when "round tripping" XML containing
UTF-8 characters. If you output the document to a string after parsing
you get the UTF-8 back out, if you just grab a node and convert to a
string you get UTF-8 characters substituted with entities:

utf8test.rb:

require 'xml/libxml'

xml = <<XML
<?xml version="1.0" encoding="UTF-8"?>
<title>This is a UTF-8 pi: π</title>
XML

parser = XML::Parser.new
parser.string = xml

doc = parser.parse

puts doc.to_s
puts doc.root.to_s


This outputs:

<?xml version="1.0" encoding="UTF-8"?>
<title>This is a UTF-8 pi: π</title>
<title>This is a UTF-8 pi: &#x3C0;</title>


I would think that the behavior of to_s by default would be to write
the XML out as a string just as it was parsed. Another variant should
be provided if character conversion is desirable.


--Paul
_______________________________________________
libxml-devel mailing list
libxml-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/libxml-devel

_______________________________________________
libxml-devel mailing list
libxml-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/libxml-devel

Reply via email to