Regex Unicode Bug?

angie ahl Wed, 01 Dec 2004 08:00:28 -0800

Hi list.

I wonder in anyone would mind confirming this for me:


I've just spotted a strange behaviour with unicode and regex in perl
5.8.1 as documented in the following script.

$junktext is a string of unicode characters containing 3 smilys. 1
smily is at the end of the string.

when doing a regex replace (s///) with the case insensitive switch OFF
all 3 smilys are replaced.
when doing the replace with the case insensitive switch ON only the
first 2 smilys are replaced.

Can anyone confirm this on their setup. Does this still occur in the
latest perl ie 5.8.5 or even 5.8.6

#!/usr/bin/perl
#
# This is a test script to demonstrate a problem found with unicode in
perls regex.
# The version of perl tested in 5.8.1 NOT the latest version
#
# $junktext is a string of unicode characters containing 3 smilys. 1
smily is at the end of the string.
#
# when doing a regex replace (s///) with the case insensitive switch
OFF all 3 smilys are replaced
# when doing the replace with the case insensitive switch ON only the
first 2 smilys are replaced
#
# author: [EMAIL PROTECTED] 2004/12/01 15:30:00

use strict;
use warnings;

use utf8;
use CGI (':standard');
use Encode qw/is_utf8 decode/;

binmode(STDOUT, ":utf8");

BEGIN {
        print header(-type => "text/html",  -charset => "utf-8");
        print start_html(-encoding => 'utf-8',-title => "Some sample
characters");
        print "\n\n";
}

my $junktext = 
"\x{0142}\x{e7}\x{263a}\x{0104}\x{263a}\x{0104}re\x{e7}enu\x{263a}";

# comment the first and uncomment the second to see it suddenly break... why?
#       my $matches = ($junktext =~ s/(\x{263a})/* was smily */g);
        my $matches = ($junktext =~ s/(\x{263a})/* was smily */gi);

print $matches .  " = " . $junktext;

END {
        print "\n\n", end_html;
}

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Regex Unicode Bug?

Reply via email to