LWP getting special (multibyte) characters from webpages

John Refior Fri, 26 Dec 2008 12:03:46 -0800

Hello,

I am writing Perl scripts that go to webpages, download certain content,
and then create a CSV file with the relevant data.  I am trying to be a
friendly web robot, so I am using the LWP::RobotUA module.


   my $ua = LWP::RobotUA->new('product_name', 'my_email');
      $ua->delay(1/60);  # max one hit every second
      $ua->timeout(40);
      $ua->env_proxy;

   sub SC_Recursive {
      my $url        = shift;
      my $response   = $ua->get($url);
      unless ($response->is_success) { die "Bad link: $url\n"; }
      my $page       = $response->content;

The problem I am having is that a number of these webpages have special
multibyte characters on them, such as the trademark symbol and registered
trademark symbol.  For example, in the CSV, the trademark (TM) symbol
shows up like

   â„¢

Now that's fine in a way, because if I redisplay them on a webpage with
<meta charset='utf-8'>, Firefox and Internet Explorer display them as
intended.  However, I want to be able to convert those special characters
to unicode or some other representation so that they will work better with
other processes I don't control.  The solution that I have come up with is
to match these characters (their hex values) in regular expressions and
replace them with the HTML representation of the appropriate unicode
value.

   s/\xe2\x84\xa2/&#x2122;/gs;
   s/\xc2\xa0/&#x00A0;/gs;
   s/\xc2\xae/&#x00Ae;/gs;
   s/\xe2\x80\x99/&#x2019;/gs;
   # etc...

My question is whether there is already an easier, more universal way to
do this, such as a module method already written for this purpose.  I have
tried to use CPAN modules, particularly Encode, to do this for me; so far
I have not found an appropriate method, but I am no expert on the modules,
perhaps I missed one or didn't correctly understand its use?

Thanks for your help, and please let me know if I need to include
additional information.

John

LWP getting special (multibyte) characters from webpages

Reply via email to