At 12:31 am +0800 3/12/04, He Zhiqiang wrote:Now i encountered another problem, there are a few files contains not only one charset but also two or more, for example, file1 contains japanese and chinese, if i use open() to load the data into memory, ord and length etc.. can't correctly work! Perhasp i miss something to encode or decode the data ?
code:
#!/usr/bin/perl -w
use utf8;
open(FD, "< file1");
while(<FD>) {
chomp;
print "length = ".length($_);
}
close FD;
----------
length() can not count the correct non-ASCII characters. :(
If the file is in UTF-8, then it may be in any number of _languages_ but it uses only one character set -- Unicode. So far as I know "use utf8" is now redundant and ineffectual in Perl. You will get the correct character count (6 characters rather than 18 bytes) by opening the file handle as utf-8 as below.
If I could say additional comment to the JD's for Zhiqiang, "use utf8" is just telling Perl parser that the program source file is written in UTF-8.
cf. <http://www.perldoc.com/perl5.8.4/lib/utf8.html>
No other effect is expected by that pragma.
Zhiqiang had to tell Perl the string is encoded with UTF-8. You should give length() the string which is so-called 'UTF8-flagged' form.
JD have already suggested how to enable UTF8-flag via ":utf8" I/O layer. cf. <http://www.perldoc.com/perl5.8.4/lib/PerlIO.html>
Another way to enable the flag is to use "utf8::decode()" function. My sample code is like below:
#!/usr/local/bin/perl -w use 5.008; use strict; use warnings;
open (TXT, '<sample2.txt'); chomp(my @text = <TXT>); close TXT;
print "utf8 flag desabled:\n"; foreach my $text (@text) { print length($text), "\n"; }
print "utf8 flag enabled:\n"; foreach my $text (@text) { utf8::decode($text); print length($text), "\n"; }
-- Masanori HATA <[EMAIL PROTECTED]> He's always with us!