Re: HTML::Parser modifies unicode characters

Dominic Mitchell Sun, 12 Sep 2004 04:13:56 -0700

Moshe Kaminsky wrote:

It appears that HTML::Parser modifies some unicode characters while parsing. The following program gives an example:
#########
#!/usr/bin/perl
use HTML::Parser;
use utf8;
open TEST, '>:utf8', 'word.txt';
my $p = new HTML::Parser text_h => [sub {print TEST shift}, 'text'];
$p->parse("zespoÅÃw\n");
close TEST;
#########
After running it, 'word.txt' contains "zespoÅÃw" with the funny l and the funny o following it transformed to something else. What am I doing wrong? I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux.

It looks like HTML::Parser is losing the UTF-8 flag. XS modules have a nasty tendency to do this. :(

Thankfully the workaround is fairly simple. Add "use Encode" to the top of the script, and change the callback slightly:

  sub { print TEST decode_utf8(shift) }

seems to work ok here.

-Dom

Re: HTML::Parser modifies unicode characters

Reply via email to