Re: HTML::Parser modifies unicode characters

Dominic Mitchell Mon, 13 Sep 2004 03:56:16 -0700

Moshe Kaminsky wrote:

* Dominic Mitchell <[EMAIL PROTECTED]> [13/09/04 12:05]:
Moshe Kaminsky wrote:
* Dominic Mitchell <[EMAIL PROTECTED]> [12/09/04 01:53]:
Moshe Kaminsky wrote:
It appears that HTML::Parser modifies some unicode characters while parsing. The following program gives an example:
#########
#!/usr/bin/perl
use HTML::Parser;
use utf8;
open TEST, '>:utf8', 'word.txt';
my $p = new HTML::Parser text_h => [sub {print TEST shift}, 'text'];
$p->parse("zespoÅÃw\n");
close TEST;
#########
After running it, 'word.txt' contains "zespoÅÃw" with the funny l and the funny o following it transformed to something else. What am I doing wrong? I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux.
It looks like HTML::Parser is losing the UTF-8 flag. XS modules have a nasty tendency to do this. :(

Thankfully the workaround is fairly simple. Add "use Encode" to the top of the script, and change the callback slightly:
sub { print TEST decode_utf8(shift) }
seems to work ok here.
Thanks! That actually works. However, my real situation is that I'm using HTML::FormatText, which uses, eventually, HTML::TreeBuilder and HTML::Parser. So to fix the problem, it appears that the only way is to modify HTML::TreeBuilder? I wonder if the maintainers of HTML::Parser are aware of this problem, and if so, why don't they do this automatically (or at least add an option to do it automatically) before giving the text to the handler?
Hmmm, it's a known problem:
http://search.cpan.org/~gaas/HTML-Parser-3.36/Parser.pm#BUGS
Thanks. I must say, though, that the explanation there is quite vague. I don't see myself deducing your solution from this statement.

It's more just guesswork, based on experience with Perl's Unicode. Most problems come down to something or other losing the UTF-8 flag on a scalar and are solved with the Encode module. Encode::_is_utf8() is a handy tool for checking that this is happening.

It doesn't look unsolveable, but it's slightly beyond my XS skills. The key is indicating the character encoding of what you're parsing, but that's sometimes difficult to determine in advance (think HTML meta tags).
I know nothing about XS, unfortunately, but the way I imagine it is that at some point, HTML::Parser calls the method given by text_h, passing the text to it. So instead of just passing the text, I suggest that it should pass decode_utf8 applied to the text. Alternatively, call a fixed (usual perl) sub 'foo', giving it the value of text_h and the text, and foo will apply decode_utf8 to the text and than pass the result to text_h.

The trouble is that there's no guarantee that in the general case, the input will always be UTF-8. At some point in all this, the input character encoding needs to be specified. Only from that can the appropriate action be taken.

As to how to fix it via HTML::FormatText, I'm not sure. You'd need to read through the code to find out what it's doing and fix at an appropriate point.
I did it. It is in fact in HTML::TreeBuilder. The thing is that I'm giving this code to people, so now I need to tell people to do this change as well (and they might not have the right permission, might not know perl, may have a different version of HTML::TreeBuilder ...)

But perhaps there is another way. Instead of writing out to a file, can you write to an in-memory string? If so, then that string would be in UTF-8-without-the-UTF-8 flag set. So you could fix that by doing "decode_utf8()" over that string before writing it to a file. Or simply write that file out without any encoding which would do no transformation of the UTF-8 bytes.
In the real life example I'm not writing to a file at all, I just did it in the example to make it easy to verify. But the usage is hidden inside HTML::FormatText, which gives me a text formatting of the whole html page. And if I try to use decode_utf8 on this result, I get other gibberish (presumably because that result already is a perl string).


-Dom

--
| Semantico: creators of major online resources          |
|       URL: http://www.semantico.com/                   |
|       Tel: +44 (1273) 722222 / Fax: +44 (1273) 723232  |
|   Address: 33 Bond St., Brighton, Sussex, BN1 1RD, UK. |

Re: HTML::Parser modifies unicode characters

Reply via email to