On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote: >> So it may be valid UTF-8, but why does it come out looking like crap? That >> is, "LaurinaviÃ≥Ÿius"? I suppose there's an > argument that >> "LaurinaviÄŸius" is correct and valid, if ugly. Maybe? > > I am unsure if this is the explanation you are looking for but here goes: > > I think the original data contained the character \x{010d}. In utf-8, that > means that it should be represented as the bytes \x{c4} and \x{8d}. If those > bytes are not marked as in fact being a two-byte utf-8 encoding of a single > character, or if an application reading the data mistakenly thinks it is not > encoded (both common errors), somewhere along the transmission an application > may decide that it needs to re-encode the characters in utf-8. > > So the original character \x{010d} is represented by the bytes \x{c4} and > \x{8d}, an application thinks those are in fact characters and encodes them > again as \x{c3} + \x{84} and \x{c2} + \x{8d}, respectively. Which I believe > is your broken data.
I see. That makes sense. FYI, the original source is at: http://pipes.yahoo.com/pipes/pipe.run?Size=Medium&_id=f53b7bed8b88412fab9715a995629722&_render=rss&max=50&nsid=1025993%40N22 Look for "Tomas" in the output. If it doesn't show pu, change max=50 to max=75 or something. > I think the error comes from Perl's handling of utf-8 data and that this > handling has changed in subtle ways all the way since Perl 5.6. We have > supported utf-8 in our applications since Perl 5.6 and have experienced this > repeatedly. Any major upgrade of Perl or indeed the much needed upgrade of > DBD::ODBC Martin Evans provided have given us a lot of work trying to sort > out these troubles. Maintaining the backwards compatibility from the pre-utf8 days must make it far more difficult than it otherwise would be. > I wonder if your code would work fine in Perl 5.8? We are "only" at 5.10(.1) > but the upgrade from 5.8 to 5.10 also gave us some utf-8 trouble. If it works > fine in Perl 5.8 maybe the error is in an assumption somewhere in XML::LibXML? In my application, I finally got XML::LibXML to choke on the invalid characters, and then found that the problem was that I was running Encode::CP1252::zap_cp1252 against the string before passing it to XML::LibXML. Once I removed that, it stopped choking. So clearly zap_cp1252 was changing bytes it should not have. I now have it running fix_cp1252 *after* the parsing, when everything is already UTF-8. Now that I think about it, though, I should probably change it so that it searches on characters instead of bytes when working on a utf8 string. Will have to look into that. In the meantime, I'll just accept that sometimes the characters are valid UTF-8 and look like shit. Frankly, when I run the above feed through NetNewsWire, the offending byte sequence displays as "Ä", just as it does in my app's output. So I blame Yahoo. Thanks for the detailed explanation, Henning, much appreciated. Best, David