On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote:

>> So it may be valid UTF-8, but why does it come out looking like crap? That 
>> is, "LaurinaviÃ≥Ÿius"? I suppose there's an > argument that 
>> "LaurinaviÄŸius" is correct and valid, if ugly. Maybe?
> 
> I am unsure if this is the explanation you are looking for but here goes:
> 
> I think the original data contained the character \x{010d}. In utf-8, that 
> means that it should be represented as the bytes \x{c4} and \x{8d}. If those 
> bytes are not marked as in fact being a two-byte utf-8 encoding of a single 
> character, or if an application reading the data mistakenly thinks it is not 
> encoded (both common errors), somewhere along the transmission an application 
> may decide that it needs to re-encode the characters in utf-8. 
> 
> So the original character \x{010d} is represented by the bytes \x{c4} and 
> \x{8d}, an application thinks those are in fact characters and encodes them 
> again as \x{c3} + \x{84} and \x{c2} + \x{8d}, respectively. Which I believe 
> is your broken data.

I see. That makes sense. FYI, the original source is at:

  
http://pipes.yahoo.com/pipes/pipe.run?Size=Medium&_id=f53b7bed8b88412fab9715a995629722&_render=rss&max=50&nsid=1025993%40N22

Look for "Tomas" in the output. If it doesn't show pu, change max=50 to max=75 
or something.

> I think the error comes from Perl's handling of utf-8 data and that this 
> handling has changed in subtle ways all the way since Perl 5.6. We have 
> supported utf-8 in our applications since Perl 5.6 and have experienced this 
> repeatedly. Any major upgrade of Perl or indeed the much needed upgrade of 
> DBD::ODBC Martin Evans provided have given us a lot of work trying to sort 
> out these troubles.

Maintaining the backwards compatibility from the pre-utf8 days must make it far 
more difficult than it otherwise would be.

> I wonder if your code would work fine in Perl 5.8? We are "only" at 5.10(.1) 
> but the upgrade from 5.8 to 5.10 also gave us some utf-8 trouble. If it works 
> fine in Perl 5.8 maybe the error is in an assumption somewhere in XML::LibXML?

In my application, I finally got XML::LibXML to choke on the invalid 
characters, and then found that the problem was that I was running 
Encode::CP1252::zap_cp1252 against the string before passing it to XML::LibXML. 
Once I removed that, it stopped choking. So clearly zap_cp1252 was changing 
bytes it should not have. I now have it running fix_cp1252 *after* the parsing, 
when everything is already UTF-8. Now that I think about it, though, I should 
probably change it so that it searches on characters instead of bytes when 
working on a utf8 string. Will have to look into that.

In the meantime, I'll just accept that sometimes the characters are valid UTF-8 
and look like shit. Frankly, when I run the above feed through NetNewsWire, the 
offending byte sequence displays as "Ä", just as it does in my app's output. So 
I blame Yahoo.

Thanks for the detailed explanation, Henning, much appreciated.

Best,

David

Reply via email to