Re: HTML::Parser modifies unicode characters

Moshe Kaminsky Mon, 13 Sep 2004 03:05:43 -0700

* Dominic Mitchell <[EMAIL PROTECTED]> [13/09/04 12:05]:
> Moshe Kaminsky wrote:
> 
> >* Dominic Mitchell <[EMAIL PROTECTED]> [12/09/04 01:53]:
> >
> >>Moshe Kaminsky wrote:
> >>
> >>>It appears that HTML::Parser modifies some unicode characters while 
> >>>parsing. The following program gives an example:
> >>>
> >>>#########
> >>>
> >>>#!/usr/bin/perl
> >>>use HTML::Parser;
> >>>use utf8;
> >>>open TEST, '>:utf8', 'word.txt';
> >>>my $p = new HTML::Parser text_h => [sub {print TEST shift}, 'text'];
> >>>$p->parse("zespoÅÃw\n");
> >>>close TEST;
> >>>
> >>>#########
> >>>
> >>>After running it, 'word.txt' contains "zespoÅÃw" with the funny l and 
> >>>the funny o following it transformed to something else. What am I doing 
> >>>wrong?
> >>>I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux.
> >>
> >>It looks like HTML::Parser is losing the UTF-8 flag.  XS modules have a 
> >>nasty tendency to do this. :(
> >>
> >>Thankfully the workaround is fairly simple.  Add "use Encode" to the top 
> >>of the script, and change the callback slightly:
> >>
> >> sub { print TEST decode_utf8(shift) }
> >>
> >>seems to work ok here.
> >
> >
> >Thanks! That actually works. However, my real situation is that I'm 
> >using HTML::FormatText, which uses, eventually, HTML::TreeBuilder and 
> >HTML::Parser. So to fix the problem, it appears that the only way is to 
> >modify HTML::TreeBuilder? I wonder if the maintainers of HTML::Parser 
> >are aware of this problem, and if so, why don't they do this 
> >automatically (or at least add an option to do it automatically) before 
> >giving the text to the handler?
> 
> Hmmm, it's a known problem:
> 
> http://search.cpan.org/~gaas/HTML-Parser-3.36/Parser.pm#BUGS


Thanks. I must say, though, that the explanation there is quite vague. I 
don't see myself deducing your solution from this statement.

> 
> It doesn't look unsolveable, but it's slightly beyond my XS skills.  
> The key is indicating the character encoding of what you're parsing, 
> but that's sometimes difficult to determine in advance (think HTML 
> meta tags).

I know nothing about XS, unfortunately, but the way I imagine it is that 
at some point, HTML::Parser calls the method given by text_h, passing 
the text to it. So instead of just passing the text, I suggest that it 
should pass decode_utf8 applied to the text. Alternatively, call a fixed 
(usual perl) sub 'foo', giving it the value of text_h and the text, and 
foo will apply decode_utf8 to the text and than pass the result to 
text_h.
> 
> As to how to fix it via HTML::FormatText, I'm not sure.  You'd need to 
> read through the code to find out what it's doing and fix at an 
> appropriate point.

I did it. It is in fact in HTML::TreeBuilder. The thing is that I'm 
giving this code to people, so now I need to tell people to do this 
change as well (and they might not have the right permission, might not 
know perl, may have a different version of HTML::TreeBuilder ...)

> But perhaps there is another way.  Instead of writing out to a file, 
>can you write to an in-memory string?  If so, then that string would be 
>in UTF-8-without-the-UTF-8 flag set.  So you could fix that by doing 
>"decode_utf8()" over that string before writing it to a file.  Or 
>simply write that file out without any encoding which would do no 
>transformation of the UTF-8 bytes.

In the real life example I'm not writing to a file at all, I just did it 
in the example to make it easy to verify. But the usage is hidden inside 
HTML::FormatText, which gives me a text formatting of the whole html 
page. And if I try to use decode_utf8 on this result, I get other 
gibberish (presumably because that result already is a perl string).

Thanks for the help.
Moshe

> 
> -Dom
> 
> -- 
> | Semantico: creators of major online resources          |
> |       URL: http://www.semantico.com/                   |
> |       Tel: +44 (1273) 722222 / Fax: +44 (1273) 723232  |
> |   Address: 33 Bond St., Brighton, Sussex, BN1 1RD, UK. |
> 

-- 
I love deadlines. I like the whooshing sound they make as they fly by. 
                                        -- Douglas Adams
    
    Moshe Kaminsky <[EMAIL PROTECTED]>

pgpX8VuOeb0Cu.pgp
Description: PGP signature

Re: HTML::Parser modifies unicode characters

Reply via email to