The following solution, proposed by Nick Ing-Simmons, worked for my case:
($charset is the charset as extracted from the html code of the page and $text is all the text from the page itself, as returned by the LWP agent.)binmode STDOOUT,":utf8"; my $encoding = find_encoding($charset); my $unicode = $encoding->decode($text); print $unicode;
Thanks a lot to Nick and to all the others who responded to my plea for help.
Now for a much less pressing issue: Does anybody know of something similar to the HTML::FormatText module that can take utf-8 input, and generate utf-8 output? In other words, of a module or command line tool to which I could feed my Japanese html pages, or html documents in other non-Latin alphabets, and get nicely formatted plain utf-8 text as output?
(HTML::FormatText seems to break with utf-8 and with the Japanese encodings.)
Thanks in advance.
Regards,
Marco
--- Marco Baroni University of Bologna http://sslmit.unibo.it/~baroni