Re: Variation In Decoding Between Encode and XML::LibXML

John Delacour Fri, 18 Jun 2010 00:05:58 -0700

At 00:27 +0100 18/6/10, I wrote:

If I save the file and undo the second decoding I get the proper output

In this case all talk of iso-8859-1 and cp1252 is a red herring. Iread several Italian websites where this same problem is manifest inexternal material such as ads. The news page proper is encodedproperly and declared as utf-8 but I imagine the web designers havereckoned that the stuff they receive from the advertisers is mostlikely to be received as windows-1252 and convert accordingly ratherthan bother to verify the encoding. As a result material that isreceived as utf-8 will undergo a superfluous encoding.


Here's a way to get the file in question properly encoded:


#!/usr/bin/perl
use strict;
use LWP::Simple;
use Encode;
no warnings; # avoid wide character warning
my $tempdir = "/tmp";
my $tempfile = "tempfile";
my $f = "$tempdir/$tempfile";
my $uri="http://pipes.yahoo.com/pipes/pipe.run";.
"?Size=Medium&_id=f53b7bed8b88412fab9715a995629722".
"&_render=rss&max=50&nsid=1025993%40N22";
if (getstore($uri, $f)){
  open F, $f or die $!;
  while (<F>){
    my $encoding = find_encoding("utf-8");
    my $utf8 = $encoding->decode($_);
    print $utf8;
  }
  close F;
}
unlink $f;

JD

Re: Variation In Decoding Between Encode and XML::LibXML

Reply via email to