* Dominic Mitchell <[EMAIL PROTECTED]> [13/09/04 12:05]: > Moshe Kaminsky wrote: > > >* Dominic Mitchell <[EMAIL PROTECTED]> [12/09/04 01:53]: > > > >>Moshe Kaminsky wrote: > >> > >>>It appears that HTML::Parser modifies some unicode characters while > >>>parsing. The following program gives an example: > >>> > >>>######### > >>> > >>>#!/usr/bin/perl > >>>use HTML::Parser; > >>>use utf8; > >>>open TEST, '>:utf8', 'word.txt'; > >>>my $p = new HTML::Parser text_h => [sub {print TEST shift}, 'text']; > >>>$p->parse("zespoÅÃw\n"); > >>>close TEST; > >>> > >>>######### > >>> > >>>After running it, 'word.txt' contains "zespoÅÃw" with the funny l and > >>>the funny o following it transformed to something else. What am I doing > >>>wrong? > >>>I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux. > >> > >>It looks like HTML::Parser is losing the UTF-8 flag. XS modules have a > >>nasty tendency to do this. :( > >> > >>Thankfully the workaround is fairly simple. Add "use Encode" to the top > >>of the script, and change the callback slightly: > >> > >> sub { print TEST decode_utf8(shift) } > >> > >>seems to work ok here. > > > > > >Thanks! That actually works. However, my real situation is that I'm > >using HTML::FormatText, which uses, eventually, HTML::TreeBuilder and > >HTML::Parser. So to fix the problem, it appears that the only way is to > >modify HTML::TreeBuilder? I wonder if the maintainers of HTML::Parser > >are aware of this problem, and if so, why don't they do this > >automatically (or at least add an option to do it automatically) before > >giving the text to the handler? > > Hmmm, it's a known problem: > > http://search.cpan.org/~gaas/HTML-Parser-3.36/Parser.pm#BUGS
Thanks. I must say, though, that the explanation there is quite vague. I don't see myself deducing your solution from this statement. > > It doesn't look unsolveable, but it's slightly beyond my XS skills. > The key is indicating the character encoding of what you're parsing, > but that's sometimes difficult to determine in advance (think HTML > meta tags). I know nothing about XS, unfortunately, but the way I imagine it is that at some point, HTML::Parser calls the method given by text_h, passing the text to it. So instead of just passing the text, I suggest that it should pass decode_utf8 applied to the text. Alternatively, call a fixed (usual perl) sub 'foo', giving it the value of text_h and the text, and foo will apply decode_utf8 to the text and than pass the result to text_h. > > As to how to fix it via HTML::FormatText, I'm not sure. You'd need to > read through the code to find out what it's doing and fix at an > appropriate point. I did it. It is in fact in HTML::TreeBuilder. The thing is that I'm giving this code to people, so now I need to tell people to do this change as well (and they might not have the right permission, might not know perl, may have a different version of HTML::TreeBuilder ...) > But perhaps there is another way. Instead of writing out to a file, >can you write to an in-memory string? If so, then that string would be >in UTF-8-without-the-UTF-8 flag set. So you could fix that by doing >"decode_utf8()" over that string before writing it to a file. Or >simply write that file out without any encoding which would do no >transformation of the UTF-8 bytes. In the real life example I'm not writing to a file at all, I just did it in the example to make it easy to verify. But the usage is hidden inside HTML::FormatText, which gives me a text formatting of the whole html page. And if I try to use decode_utf8 on this result, I get other gibberish (presumably because that result already is a perl string). Thanks for the help. Moshe > > -Dom > > -- > | Semantico: creators of major online resources | > | URL: http://www.semantico.com/ | > | Tel: +44 (1273) 722222 / Fax: +44 (1273) 723232 | > | Address: 33 Bond St., Brighton, Sussex, BN1 1RD, UK. | > -- I love deadlines. I like the whooshing sound they make as they fly by. -- Douglas Adams Moshe Kaminsky <[EMAIL PROTECTED]>
pgpX8VuOeb0Cu.pgp
Description: PGP signature