Re: [Israel.pm] Perl unicode question

Issac Goldstand Mon, 13 Feb 2012 03:13:27 -0800

Awesome.  Just got it, thanks :)

On 13/02/2012 12:55, Meir Guttman wrote:
> Dear Yitzchak,
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of 
> Issac Goldstand
> Sent: Monday, February 13, 2012 12:30 PM
> To: Perl in Israel
> Subject: [Israel.pm] Perl unicode question
>
> If there's one thing I can never seem to get straight, it's character 
> encodings...
>
> I'm trying to parse some data from the web which can come in different 
> encodings, and write unit tests which come from static files.
>
> One of the strings that I'm trying to test for is "Forex Trading Avec 100€"  
> The string is originally encoded (supposedly) in ISO-8859-1 based on the 
> header Content-Type: text/html; charset=ISO-8859-1 and presence of the 
> following META tag <meta http-equiv="Content-Type"
> content="text/html; charset=ISO-8859-1">
>
> (N.B. I'm a bit confused by that as IIRC, ISO-8859-1 doesn't contain the EUR 
> character...)
>
> When opening the source code in a text editor as either ISO-8859-1 or
> ISO-8859-15 (or even UTF-8), I can't see the character.  I *do* see the 
> character when viewing it as CP1255 which kinda worries me, as I get the 
> feeling I'm a lot farther from the source as I think when I see that...
>
> My unit test for above test is as following:
>
> use utf8; # String literals contain UTF-8 in this file binmode STDOUT 
> ":utf8"; ...
> open($fh, "<:encoding(ISO-8859-1)", "t/html0004.html") || die "...: $!"; 
> $parser->parse_file($fh); # Subclassed HTML::Parser ...
> is($test->{top}, "Forex Trading Avec 100€", "Correct headline text");
>
> However, this test does not pass on the EURO, giving me the following
> result:
> Wide character in print at /usr/local/share/perl/5.12.4/Test/Builder.pm
> line 1759.
> #          got: 'Forex Trading Avec 100Â€'
> #     expected: 'Forex Trading Avec 100€'
>
> Both the warning and the mismatch bother me....  The warning, because I 
> assumed that opening STDOUT as a utf8 stream would deal with it.  And the 
> mismatch, because I can't figure why it's mismatching...
>
> FWIW, when doing this on the web, I'd planned on converting to utf-8 by using 
> HTTP::Response's $res->decoded_content to deal with the encoding for me, but 
> that seems to be spewing characters that... don't look correct... too :/
>
> Any ideas?
>
>
> =======================================
>
> There are a number of things that must be done together so that Unicode will 
> be supported. And don't put too much weight on the "charset..." cluse in the 
> HTML.
>
> Since this list does not accept attachments, I'll send to your personal 
> address my upcoming presentation on "Unicode aspects in Perl", to be 
> presented in the Israel Perl Workshop 2012 (http://act.perl.org.il/ilpw2012/).
>
> Anybody else who is interested is welcomed to ask and I'll send it to her/him 
> too.
>
>
> _______________________________________________
> Perl mailing list
> [email protected]
> http://mail.perl.org.il/mailman/listinfo/perl


_______________________________________________
Perl mailing list
[email protected]
http://mail.perl.org.il/mailman/listinfo/perl

Re: [Israel.pm] Perl unicode question

Reply via email to