There is a serious inconsistency when "round tripping" XML containing  
UTF-8 characters. If you output the document to a string after parsing  
you get the UTF-8 back out, if you just grab a node and convert to a  
string you get UTF-8 characters substituted with entities:

utf8test.rb:

require 'xml/libxml'

xml = <<XML
<?xml version="1.0" encoding="UTF-8"?>
<title>This is a UTF-8 pi: π</title>
XML

parser = XML::Parser.new
parser.string = xml

doc = parser.parse

puts doc.to_s
puts doc.root.to_s


This outputs:

<?xml version="1.0" encoding="UTF-8"?>
<title>This is a UTF-8 pi: π</title>
<title>This is a UTF-8 pi: &#x3C0;</title>


I would think that the behavior of to_s by default would be to write  
the XML out as a string just as it was parsed. Another variant should  
be provided if character conversion is desirable.


--Paul
_______________________________________________
libxml-devel mailing list
libxml-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/libxml-devel

Reply via email to