Re: About HTML unicode

Masanori HATA Sun, 05 Dec 2004 02:17:21 -0800

John Delacour wrote:

At 12:31 am +0800 3/12/04, He Zhiqiang wrote:
Now i encountered another problem, there are a few files contains not only one charset but also two or more, for example, file1 contains japanese and chinese, if i use open() to load the data into memory, ord and length etc.. can't correctly work! Perhasp i miss something to encode or decode the data ? code: #!/usr/bin/perl -w use utf8; open(FD, "< file1"); while(<FD>) { chomp; print "length = ".length($_); } close FD; ---------- length() can not count the correct non-ASCII characters. :(
If the file is in UTF-8, then it may be in any number of _languages_ but it uses only one character set -- Unicode. So far as I know "use utf8" is now redundant and ineffectual in Perl. You will get the correct character count (6 characters rather than 18 bytes) by opening the file handle as utf-8 as below.

If I could say additional comment to the JD's for Zhiqiang, "use utf8" is just telling Perl parser that the program source file is written in UTF-8. cf. <http://www.perldoc.com/perl5.8.4/lib/utf8.html>

No other effect is expected by that pragma.

Zhiqiang had to tell Perl the string is encoded with UTF-8. You should give length() the string which is so-called 'UTF8-flagged' form.

JD have already suggested how to enable UTF8-flag via ":utf8" I/O layer.
cf. <http://www.perldoc.com/perl5.8.4/lib/PerlIO.html>

Another way to enable the flag is to use "utf8::decode()" function.
My sample code is like below:

#!/usr/local/bin/perl -w
use 5.008;
use strict;
use warnings;

open (TXT, '<sample2.txt');
chomp(my @text = <TXT>);
close TXT;

print "utf8 flag desabled:\n";
foreach my $text (@text) {
    print length($text), "\n";
}

print "utf8 flag enabled:\n";
foreach my $text (@text) {
    utf8::decode($text);
    print length($text), "\n";
}

--
Masanori HATA
<[EMAIL PROTECTED]>
He's always with us!

Re: About HTML unicode

Reply via email to