Hello, I am writing Perl scripts that go to webpages, download certain content, and then create a CSV file with the relevant data. I am trying to be a friendly web robot, so I am using the LWP::RobotUA module.
my $ua = LWP::RobotUA->new('product_name', 'my_email'); $ua->delay(1/60); # max one hit every second $ua->timeout(40); $ua->env_proxy; sub SC_Recursive { my $url = shift; my $response = $ua->get($url); unless ($response->is_success) { die "Bad link: $url\n"; } my $page = $response->content; The problem I am having is that a number of these webpages have special multibyte characters on them, such as the trademark symbol and registered trademark symbol. For example, in the CSV, the trademark (TM) symbol shows up like â„¢ Now that's fine in a way, because if I redisplay them on a webpage with <meta charset='utf-8'>, Firefox and Internet Explorer display them as intended. However, I want to be able to convert those special characters to unicode or some other representation so that they will work better with other processes I don't control. The solution that I have come up with is to match these characters (their hex values) in regular expressions and replace them with the HTML representation of the appropriate unicode value. s/\xe2\x84\xa2/™/gs; s/\xc2\xa0/ /gs; s/\xc2\xae/®/gs; s/\xe2\x80\x99/’/gs; # etc... My question is whether there is already an easier, more universal way to do this, such as a module method already written for this purpose. I have tried to use CPAN modules, particularly Encode, to do this for me; so far I have not found an appropriate method, but I am no expert on the modules, perhaps I missed one or didn't correctly understand its use? Thanks for your help, and please let me know if I need to include additional information. John