Hi all,
I have run into a roadblock in trying to parse an RSS feed. Many
bloggers use so-called "smart quotes" in their posts. Smart quotes are
those curly single or double quotes. Ideally they should replace the
smart quotes with HTML encoded entities, but that appears not to
happen in many cases.
Where the smart quotes are represented as HTML encoded entities I am
"fixing" the quotes by replacing them with simple single or double
quotes with code similar to this:
$xml =~ s/&(amp;)?#8217;/'/g;
In those cases where they are not encoded, I want to read in an RSS
feed, and substitute any single curly quote with a straight quote " '
".
If I have done my research properly, the left-side single curly quote
can be generated in perl by passing the value 145 to Perl's chr()
method.
Here's the basic idea for what I want to do:
my $xml = get_xml_from_rss(); #this xml includes smart quotes
$xml =~ s/chr(145)/'/g;
Here's a sample script that pulls an RSS feed with this issue:
#!/usr/bin/perl -w
use strict;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $response = $ua->get('http://feeds.feedburner.com/ozphactor/entries/rss2');
my $xml;
if ($response->is_success) {
$xml = $response->content;
}
else {
die $response->status_line;
}
$xml =~ s/chr(145)/'/g;
$xml =~ s/chr(146)/'/g;
$xml =~ s/chr(147)/"/g;
$xml =~ s/chr(148)/"/g;
print STDERR $xml, "\n";
The script, of course doesn't work. Can anyone help me fix this?
Thanks,
Chris
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm