Moshe Kaminsky wrote:

* Dominic Mitchell <[EMAIL PROTECTED]> [12/09/04 01:53]:

Moshe Kaminsky wrote:

It appears that HTML::Parser modifies some unicode characters while parsing. The following program gives an example:

#########

#!/usr/bin/perl
use HTML::Parser;
use utf8;
open TEST, '>:utf8', 'word.txt';
my $p = new HTML::Parser text_h => [sub {print TEST shift}, 'text'];
$p->parse("zespoÅÃw\n");
close TEST;

#########

After running it, 'word.txt' contains "zespoÅÃw" with the funny l and the funny o following it transformed to something else. What am I doing wrong?
I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux.

It looks like HTML::Parser is losing the UTF-8 flag. XS modules have a nasty tendency to do this. :(


Thankfully the workaround is fairly simple. Add "use Encode" to the top of the script, and change the callback slightly:

 sub { print TEST decode_utf8(shift) }

seems to work ok here.


Thanks! That actually works. However, my real situation is that I'm using HTML::FormatText, which uses, eventually, HTML::TreeBuilder and HTML::Parser. So to fix the problem, it appears that the only way is to modify HTML::TreeBuilder? I wonder if the maintainers of HTML::Parser are aware of this problem, and if so, why don't they do this automatically (or at least add an option to do it automatically) before giving the text to the handler?

Hmmm, it's a known problem:

http://search.cpan.org/~gaas/HTML-Parser-3.36/Parser.pm#BUGS

It doesn't look unsolveable, but it's slightly beyond my XS skills. The key is indicating the character encoding of what you're parsing, but that's sometimes difficult to determine in advance (think HTML meta tags).

As to how to fix it via HTML::FormatText, I'm not sure. You'd need to read through the code to find out what it's doing and fix at an appropriate point. But perhaps there is another way. Instead of writing out to a file, can you write to an in-memory string? If so, then that string would be in UTF-8-without-the-UTF-8 flag set. So you could fix that by doing "decode_utf8()" over that string before writing it to a file. Or simply write that file out without any encoding which would do no transformation of the UTF-8 bytes.

-Dom

--
| Semantico: creators of major online resources          |
|       URL: http://www.semantico.com/                   |
|       Tel: +44 (1273) 722222 / Fax: +44 (1273) 723232  |
|   Address: 33 Bond St., Brighton, Sussex, BN1 1RD, UK. |

Reply via email to