Re: Segfault using HTML::Entities

Richard Jolly Wed, 30 Jun 2004 14:25:56 -0700


On 30 Jun 2004, at 17:25, Eric Cholet wrote:

Le 30 juin 04, à 14:46, Richard Jolly a écrit :
In my original mail the offending line was:
<title>The Modern R&amp;eacute;sum&amp;eacute;</title>
Now this is a bit off, because is RSS, therefore utf8, but its got encoded latin1 entities (é) in there, with the & further encoded for xml safety.
I'm no XML expert, but this doesn't look right. An e acute is &eacute;
whereas &amp;eacute is &eacute. It's not "safer", it's different.
IMHO the double encoding is in the XML data itself.

Definitely the original rss is messed up - it shouldn't need é, because it should be utf8. The script I wrote was an attempt to get the xml back to how the utf8 should be, and then html-encode it for web display (for legacy reasons I can't display it as utf8). The garbage I'm finding in RSS feeds is terrible, I just came across:

    <title>BCCI confirms India A&amp;Acirc;Â’s Zimbabwe tour</title>

in a supposedly utf8 feed (see my post on perlmonks at http://www.perlmonks.org/index.pl?node_id=370892 )

Also, saying &eacute; et al are "latin1" entities doesn't make
sense to me, since entities are a way to encode non ASCII characters
into an ASCII representation-- this is orthogonal to the XML document's
encoding or the XML parser's output encoding.

True - I keep struggling to find the right terminology. I've moved to using 'html-encoding' to indicate using named entities such as é, but thats still not very good.

--
Eric Cholet

Re: Segfault using HTML::Entities

Reply via email to