To follow up on Tom's good examples, which I believe were run on Perl 6,
I decided to try with Perl 5.8.0, and I found that version of Perl _is_
indeed a lot better.

In Perl 5.8, the idea is that the internal representation (single-byte
or utf8) should not be visible to the programmer. So Perl may choose
either internal representation for a string with characters in the range
128-255, but applications will not experience a difference.

A bit of code to test this by showing internal representation, length,
character codes, and output for different strings:

use Devel::Peek;
sub mydump {
  my ($x) = @_;
  Dump $x;
  print "LENGTH=", length($x), ".\n";
  print "CHARS: ", join(",", map(ord(substr($x,$_,1)), 0..length($x)-1)), "\n";
  print "VALUE: '", $x, "'.\n";
}

Here is what Perl 5.8 is doing:

my $string_0 = "\xa3";
mydump($string_0);

SV = PV(0x811f694) at 0x8128d54
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK)
  PV = 0x81256c8 "\243"\0
  CUR = 1
  LEN = 2
LENGTH=1.
CHARS: 163
VALUE: '£'.

No surprises.

my $string_1 = pack("U0a*","\302\243"); # Force utf8 internal representation.
mydump($string_1);

SV = PV(0x811f694) at 0x8128d54
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x81256c8 "\302\243"\0 [UTF8 "\x{a3}"]
  CUR = 2
  LEN = 3
LENGTH=1.
CHARS: 163
VALUE: '£'.

This is the _same_ string as $string_0, but now stored in a different
(utf8) internal representation. But note that the strings works
identical in the application; the length, the containing characters, and
the output is identical (IO in Perl 5.8 by default uses single-byte
encoding).

my $string_2 = $string_0 . $string_1;
mydump($string_2);

SV = PV(0x811f694) at 0x8128d54
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x81256c8 "\302\243\302\243"\0 [UTF8 "\x{a3}\x{a3}"]
  CUR = 4
  LEN = 5
LENGTH=2.
CHARS: 163,163
VALUE: '££'.

When joining $string_0 and $string_1, Perl decides to use utf8 internal
encoding, but again the application sees no difference.

my $string_3 = "\x{263a}";
mydump($string_3);

SV = PV(0x811f694) at 0x8128d54
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x81256c8 "\342\230\272"\0 [UTF8 "\x{263a}"]
  CUR = 3
  LEN = 5
LENGTH=1.
CHARS: 9786
Wide character in print at ./test.pl line 10.
VALUE: '☺'.

Here we have a char > 255. Length and characters work ok, but for output
Perl detects that single-byte encoding cannot output the string
correctly. It switches to UTF-8 encoding with a warning.

my $string_4 = $string_0 . $string_3;
mydump($string_4);

SV = PV(0x811f694) at 0x8128d54
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x81256c8 "\302\243\342\230\272"\0 [UTF8 "\x{a3}\x{263a}"]
  CUR = 5
  LEN = 6
LENGTH=2.
CHARS: 163,9786
Wide character in print at ./test.pl line 10.
VALUE: '£☺'.

Again, the >255 char forces a switch to UTF-8 encoding, and a warning.

As far as I can tell, Perl 5.8 does the right thing, and people
generally should not have too many problems under it. But see what Perl
5.6.1 is doing:

my $string_0 = "\xa3";
mydump($string_0);

SV = PV(0x80f6408) at 0x8100300
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK)
  PV = 0x81018e0 "\243"\0
  CUR = 1
  LEN = 2
LENGTH=1.
CHARS: 163
VALUE: '£'.

my $string_1 = pack("U0a*","\302\243"); # Force UTF8 internal representation.
mydump($string_1);

SV = PV(0x80f6408) at 0x8100300
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x81018e0 "\302\243"\0
  CUR = 2
  LEN = 3
LENGTH=1.
CHARS: 163
VALUE: '£'.

Perl 5.6.1 gives _different_ output for the _same_ string, depending on
the internal representation. This is bad, which is probably the reason
why the 5.6.1 docs warns that utf8 support is not stable in that
version.

I guess the issue is simply that XML::Parser (and derivatives) should
not use have utf8 features by default in Perl 5.6.1, since its use isn't
stable until 5.8.

Likewise, I think DBI and drivers should not by default enable utf8
support in Perl < 5.8 (but please give an option to enable it for those
of us who need to use UTF-8 in Perl 5.6.1, and know what we are doing).

 - Kristian.

Reply via email to