Fellow Perlers, I'm parsing a lot of XML these days, and came upon a a Yahoo! Pipes feed that appears to mangle an originating Flickr feed. But the curious thing is, when I pull the offending string out of the RSS and just stick it in a script, Encode knows how to decode it properly, while XML::LibXML (and my Unicode-aware editors) cannot.
The attached script demonstrates. $str has the bogus-looking character". Encode, however, seems to properly convert it to the "č" in "Laurinavičius" in the output. XML::LibXML, OTOH, outputs it as "LaurinaviÄius" -- that is, broken. (If things look truly borked in this email too, please look at the attached script.) So my question is, what gives? Is this truly a broken representation of the character and Encode just figures that out and fixes it? Or is there something off with my editor and with XML::LibXML. FWIW, the character looks correct in my editor when I load it from the original Flickr feed. It's only after processing by Yahoo! Pipes that it comes out looking mangled. Any insights would be appreciated. Best, David
#!/usr/local/bin/perl -w use strict; use Encode; use XML::LibXML; my $parser = XML::LibXML->new({ no_network => 1, encoding => 'utf-8', }); my $str = '<p>Tomas LaurinaviÃÂius</p>'; print $str, $/; my $copy = $str; my $utf8 = decode('utf-8', $copy, 1); print $utf8, $/; my $doc = $parser->parse_html_string($str, encoding => 'utf-8'); print $doc->documentElement->toString, $/;