The handling of encoding is not coherent in the extension, as my last
patch on the topic illustrates. While I have no doubt that there are
issues to resolve, in this particular instance I do not get the
result you do.
Anyone wanting to look at the way encoding is handled is welcome to
make a recommendation.
Dan
On Nov 27, 2007, at 11:41, Paul Dlug wrote:
There is a serious inconsistency when "round tripping" XML containing
UTF-8 characters. If you output the document to a string after parsing
you get the UTF-8 back out, if you just grab a node and convert to a
string you get UTF-8 characters substituted with entities:
utf8test.rb:
require 'xml/libxml'
xml = <<XML
<?xml version="1.0" encoding="UTF-8"?>
<title>This is a UTF-8 pi: π</title>
XML
parser = XML::Parser.new
parser.string = xml
doc = parser.parse
puts doc.to_s
puts doc.root.to_s
This outputs:
<?xml version="1.0" encoding="UTF-8"?>
<title>This is a UTF-8 pi: π</title>
<title>This is a UTF-8 pi: π</title>
I would think that the behavior of to_s by default would be to write
the XML out as a string just as it was parsed. Another variant should
be provided if character conversion is desirable.
--Paul
_______________________________________________
libxml-devel mailing list
libxml-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/libxml-devel
_______________________________________________
libxml-devel mailing list
libxml-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/libxml-devel