Hello.
On Thu, 21 Mar 2002 10:07:09 +0100
[EMAIL PROTECTED] (Andreas J. Koenig) wrote:
> Larry's recent favorite bug posting has yielded fruit, very nice
> indeed, thanks. But now I read the recently edited paragraph from
> perlunicode.pod:
>
> If the keys of a hash are "mixed", that is, some keys are Unicode,
> while some keys are "byte", the keys may behave differently in regular
> expressions since the definition of character classes like C</\w/>
> is different for byte strings and character strings. This problem can
> sometimes be helped by using an appropriate locale (see L<perllocale>).
> Another way is to force all the strings to be character encoded by
> using utf8::upgrade() (see L<utf8>).
>
> My headache starts with the last sentence. The whole truth would be
>
> Another way is to force all the strings to be character encoded by
> using utf8::upgrade() WHENEVER YOU ARE GOING TO USE A REGULAR
> EXPRESSION WITH CHARACTER SEMANTICS.
>
> Without the locale thingy, it will not suffice to make sure, all
> strings are upgraded to Unicode, you will also need to make sure, they
> are *still* upgraded whenever you use a regular expression with a
> character class.
>
> Demonstration:
>
> % /usr/local/perl-5.7.3@15380/bin/perl -e '
> my $u = "f\x{df}";
> require utf8;
> utf8::upgrade($u);
> my %u = ( $u => $u ); # might happen in a module too
> for (keys %u){
> my $m1 = /^\w*$/;
> my $m2 = $u{$_}=~/^\w*$/;
> print $m1==$m2 ? "ok\n" : "not ok\n";
> }
> '
> not ok
hmm, but such a test says ok.
#!perl
my $u = "f\x{df}";
utf8::upgrade($u);
my %u = ( $u => $u ); # might happen in a module too
my $m1 = $u =~ /^\w*$/;
my $m2 = $u{$u} =~ /^\w*$/;
print $m1==$m2 ? "ok\n" : "not ok\n";
__END__
>
> See, upgrading once is not enough, you need to upgrade everywhere you
> use a regular expression with character semantics:
>
> % /usr/local/perl-5.7.3@15380/bin/perl -e '
> my $u = "f\x{df}";
> require utf8;
> utf8::upgrade($u);
> my %u = ( $u => $u ); # might happen in a module too
> for (keys %u){
> utf8::upgrade($_); ####
> utf8::upgrade($u{$_}); #### 2 lines added
> my $m1 = /^\w*$/;
> my $m2 = $u{$_}=~/^\w*$/;
> print $m1==$m2 ? "ok\n" : "not ok\n";
> }
> '
> ok
Hash keys seem to be stored after downgraded...
Then, necessity is only one line added, isn't it?
#!perl
my $u = "f\x{df}";
utf8::upgrade($u);
my %u = ( $u => $u );
for (keys %u){
utf8::upgrade($_);
my $m1 = /^\w*$/;
my $m2 = $u{$_}=~/^\w*$/;
print $m1==$m2 ? "ok\n" : "not ok\n";
}
__END__
Nevertheless, we shouldn't distinguish Unicode-ness of hash keys;
otherwise we'd be upset more... :-)
#!perl
use charnames qw(:full);
my $alpha = "\N{GREEK SMALL LETTER ALPHA}";
# "\x{945}" = "\xCE\xB1" UTF8
my $latin =
"\N{LATIN CAPITAL LETTER I WITH CIRCUMFLEX}\N{PLUS-MINUS SIGN}";
# "\xCE\xB1" Bytes
my %hash;
$hash{$alpha} = "foo";
$hash{$latin} = "bar";
print $hash{$alpha} eq $hash{$latin} ? "not ok" : "ok";
# Perl 5.6.1 says "not ok",
# while Perl 5.7.3 says "ok".
> I'm sure everybody will agree that this is not only unperlish, it is
> unbearable and falls back behind 5.005_50. For that reason I would
> suggest to drop the mention of utf8::upgrade here, maybe thusly:
\p{Word} seems always to work Unicode-oriented \w.
Can it be a solution?
#!perl
my $u = "f\x{df}";
my %u = ( $u => $u );
for (keys %u){
my $m1 = /^\p{Word}*$/;
my $m2 = $u{$_}=~/^\p{Word}*$/;
print $m1 && $m2 ? "ok\n" : "not ok\n";
}
# naturaly we never wish both $m1 and $m2 are false.
> --- pod/perlunicode.pod~ Thu Mar 21 08:15:43 2002
> +++ pod/perlunicode.pod Thu Mar 21 09:59:33 2002
> @@ -966,9 +966,7 @@
> while some keys are "byte", the keys may behave differently in regular
> expressions since the definition of character classes like C</\w/>
> is different for byte strings and character strings. This problem can
> -sometimes be helped by using an appropriate locale (see L<perllocale>).
> -Another way is to force all the strings to be character encoded by
> -using utf8::upgrade() (see L<utf8>).
> +be helped by using an UTF-8 locale (see L<perllocale>).
>
> Some functions are slower when working on UTF-8 encoded strings than
> on byte encoded strings. All functions that need to hop over
>
>
>
>
> Another possibility is, of course, that the demonstrated behaviour is
> a vanilla bug and gets fixed before 5.8.0. :-/
>
>
>
> --
> andreas
Sincerely
SADAHIRO Tomoyuki