Re: perlunicode.pod mention of utf8::upgrade questionable

SADAHIRO Tomoyuki Thu, 21 Mar 2002 04:55:45 -0800


Hello.


On Thu, 21 Mar 2002 10:07:09 +0100
[EMAIL PROTECTED] (Andreas J. Koenig) wrote:

> Larry's recent favorite bug posting has yielded fruit, very nice
> indeed, thanks. But now I read the recently edited paragraph from
> perlunicode.pod:
> 
>     If the keys of a hash are "mixed", that is, some keys are Unicode,
>     while some keys are "byte", the keys may behave differently in regular
>     expressions since the definition of character classes like C</\w/>
>     is different for byte strings and character strings.  This problem can
>     sometimes be helped by using an appropriate locale (see L<perllocale>).
>     Another way is to force all the strings to be character encoded by
>     using utf8::upgrade() (see L<utf8>).
> 
> My headache starts with the last sentence. The whole truth would be
> 
>     Another way is to force all the strings to be character encoded by
>     using utf8::upgrade() WHENEVER YOU ARE GOING TO USE A REGULAR
>     EXPRESSION WITH CHARACTER SEMANTICS.
> 
> Without the locale thingy, it will not suffice to make sure, all
> strings are upgraded to Unicode, you will also need to make sure, they
> are *still* upgraded whenever you use a regular expression with a
> character class.
> 
> Demonstration:
> 
> % /usr/local/perl-5.7.3@15380/bin/perl -e '
>   my $u = "f\x{df}";
>   require utf8;
>   utf8::upgrade($u);
>   my %u = ( $u => $u );            # might happen in a module too
>   for (keys %u){
>     my $m1 = /^\w*$/;
>     my $m2 = $u{$_}=~/^\w*$/;
>     print $m1==$m2 ? "ok\n" : "not ok\n";                
>   }
> '
> not ok

hmm, but such a test says ok.

#!perl
  my $u = "f\x{df}";
  utf8::upgrade($u);
  my %u = ( $u => $u );    # might happen in a module too
  
  my $m1 = $u =~ /^\w*$/;
  my $m2 = $u{$u} =~ /^\w*$/;
  print $m1==$m2 ? "ok\n" : "not ok\n";

__END__


> 
> See, upgrading once is not enough, you need to upgrade everywhere you
> use a regular expression with character semantics:
> 
> % /usr/local/perl-5.7.3@15380/bin/perl -e '
>   my $u = "f\x{df}";
>   require utf8;
>   utf8::upgrade($u);
>   my %u = ( $u => $u );            # might happen in a module too
>   for (keys %u){
>     utf8::upgrade($_);             ####
>     utf8::upgrade($u{$_});         ####  2 lines added
>     my $m1 = /^\w*$/;            
>     my $m2 = $u{$_}=~/^\w*$/;            
>     print $m1==$m2 ? "ok\n" : "not ok\n";
>   }
> '
> ok

Hash keys seem to be stored after downgraded...
Then, necessity is only one line added, isn't it?

#!perl
   my $u = "f\x{df}";
   utf8::upgrade($u);
   my %u = ( $u => $u );
   for (keys %u){
     utf8::upgrade($_);
     my $m1 = /^\w*$/;
     my $m2 = $u{$_}=~/^\w*$/;
     print $m1==$m2 ? "ok\n" : "not ok\n";
  }

__END__

Nevertheless, we shouldn't distinguish Unicode-ness of hash keys;
otherwise we'd be upset more... :-)

#!perl
use charnames qw(:full);

my $alpha = "\N{GREEK SMALL LETTER ALPHA}";
   # "\x{945}" = "\xCE\xB1" UTF8

my $latin =
  "\N{LATIN CAPITAL LETTER I WITH CIRCUMFLEX}\N{PLUS-MINUS SIGN}";
   # "\xCE\xB1" Bytes

my %hash;
$hash{$alpha} = "foo";
$hash{$latin} = "bar";

print $hash{$alpha} eq $hash{$latin} ? "not ok" : "ok";

# Perl 5.6.1 says "not ok",
# while Perl 5.7.3 says "ok".

> I'm sure everybody will agree that this is not only unperlish, it is
> unbearable and falls back behind 5.005_50. For that reason I would
> suggest to drop the mention of utf8::upgrade here, maybe thusly:

\p{Word} seems always to work Unicode-oriented \w.
Can it be a solution?

#!perl
  my $u = "f\x{df}";
  my %u = ( $u => $u );
  for (keys %u){
     my $m1 = /^\p{Word}*$/;
     my $m2 = $u{$_}=~/^\p{Word}*$/;
     print $m1 && $m2 ? "ok\n" : "not ok\n";
  }
  # naturaly we never wish both $m1 and $m2 are false.


> --- pod/perlunicode.pod~      Thu Mar 21 08:15:43 2002
> +++ pod/perlunicode.pod       Thu Mar 21 09:59:33 2002
> @@ -966,9 +966,7 @@
>  while some keys are "byte", the keys may behave differently in regular
>  expressions since the definition of character classes like C</\w/>
>  is different for byte strings and character strings.  This problem can
> -sometimes be helped by using an appropriate locale (see L<perllocale>).
> -Another way is to force all the strings to be character encoded by
> -using utf8::upgrade() (see L<utf8>).
> +be helped by using an UTF-8 locale (see L<perllocale>).
>  
>  Some functions are slower when working on UTF-8 encoded strings than
>  on byte encoded strings. All functions that need to hop over
> 
> 
> 
> 
> Another possibility is, of course, that the demonstrated behaviour is
> a vanilla bug and gets fixed before 5.8.0.  :-/
> 
> 
> 
> -- 
> andreas

Sincerely
SADAHIRO Tomoyuki

Re: perlunicode.pod mention of utf8::upgrade questionable

Reply via email to