Awesome. Just got it, thanks :) On 13/02/2012 12:55, Meir Guttman wrote: > Dear Yitzchak, > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of > Issac Goldstand > Sent: Monday, February 13, 2012 12:30 PM > To: Perl in Israel > Subject: [Israel.pm] Perl unicode question > > If there's one thing I can never seem to get straight, it's character > encodings... > > I'm trying to parse some data from the web which can come in different > encodings, and write unit tests which come from static files. > > One of the strings that I'm trying to test for is "Forex Trading Avec 100€" > The string is originally encoded (supposedly) in ISO-8859-1 based on the > header Content-Type: text/html; charset=ISO-8859-1 and presence of the > following META tag <meta http-equiv="Content-Type" > content="text/html; charset=ISO-8859-1"> > > (N.B. I'm a bit confused by that as IIRC, ISO-8859-1 doesn't contain the EUR > character...) > > When opening the source code in a text editor as either ISO-8859-1 or > ISO-8859-15 (or even UTF-8), I can't see the character. I *do* see the > character when viewing it as CP1255 which kinda worries me, as I get the > feeling I'm a lot farther from the source as I think when I see that... > > My unit test for above test is as following: > > use utf8; # String literals contain UTF-8 in this file binmode STDOUT > ":utf8"; ... > open($fh, "<:encoding(ISO-8859-1)", "t/html0004.html") || die "...: $!"; > $parser->parse_file($fh); # Subclassed HTML::Parser ... > is($test->{top}, "Forex Trading Avec 100€", "Correct headline text"); > > However, this test does not pass on the EURO, giving me the following > result: > Wide character in print at /usr/local/share/perl/5.12.4/Test/Builder.pm > line 1759. > # got: 'Forex Trading Avec 100€' > # expected: 'Forex Trading Avec 100€' > > Both the warning and the mismatch bother me.... The warning, because I > assumed that opening STDOUT as a utf8 stream would deal with it. And the > mismatch, because I can't figure why it's mismatching... > > FWIW, when doing this on the web, I'd planned on converting to utf-8 by using > HTTP::Response's $res->decoded_content to deal with the encoding for me, but > that seems to be spewing characters that... don't look correct... too :/ > > Any ideas? > > > ======================================= > > There are a number of things that must be done together so that Unicode will > be supported. And don't put too much weight on the "charset..." cluse in the > HTML. > > Since this list does not accept attachments, I'll send to your personal > address my upcoming presentation on "Unicode aspects in Perl", to be > presented in the Israel Perl Workshop 2012 (http://act.perl.org.il/ilpw2012/). > > Anybody else who is interested is welcomed to ask and I'll send it to her/him > too. > > > _______________________________________________ > Perl mailing list > [email protected] > http://mail.perl.org.il/mailman/listinfo/perl
_______________________________________________ Perl mailing list [email protected] http://mail.perl.org.il/mailman/listinfo/perl
