To follow up on Tom's good examples, which I believe were run on Perl 6, I decided to try with Perl 5.8.0, and I found that version of Perl _is_ indeed a lot better.
In Perl 5.8, the idea is that the internal representation (single-byte or utf8) should not be visible to the programmer. So Perl may choose either internal representation for a string with characters in the range 128-255, but applications will not experience a difference. A bit of code to test this by showing internal representation, length, character codes, and output for different strings: use Devel::Peek; sub mydump { my ($x) = @_; Dump $x; print "LENGTH=", length($x), ".\n"; print "CHARS: ", join(",", map(ord(substr($x,$_,1)), 0..length($x)-1)), "\n"; print "VALUE: '", $x, "'.\n"; } Here is what Perl 5.8 is doing: my $string_0 = "\xa3"; mydump($string_0); SV = PV(0x811f694) at 0x8128d54 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x81256c8 "\243"\0 CUR = 1 LEN = 2 LENGTH=1. CHARS: 163 VALUE: '£'. No surprises. my $string_1 = pack("U0a*","\302\243"); # Force utf8 internal representation. mydump($string_1); SV = PV(0x811f694) at 0x8128d54 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x81256c8 "\302\243"\0 [UTF8 "\x{a3}"] CUR = 2 LEN = 3 LENGTH=1. CHARS: 163 VALUE: '£'. This is the _same_ string as $string_0, but now stored in a different (utf8) internal representation. But note that the strings works identical in the application; the length, the containing characters, and the output is identical (IO in Perl 5.8 by default uses single-byte encoding). my $string_2 = $string_0 . $string_1; mydump($string_2); SV = PV(0x811f694) at 0x8128d54 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x81256c8 "\302\243\302\243"\0 [UTF8 "\x{a3}\x{a3}"] CUR = 4 LEN = 5 LENGTH=2. CHARS: 163,163 VALUE: '££'. When joining $string_0 and $string_1, Perl decides to use utf8 internal encoding, but again the application sees no difference. my $string_3 = "\x{263a}"; mydump($string_3); SV = PV(0x811f694) at 0x8128d54 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x81256c8 "\342\230\272"\0 [UTF8 "\x{263a}"] CUR = 3 LEN = 5 LENGTH=1. CHARS: 9786 Wide character in print at ./test.pl line 10. VALUE: '☺'. Here we have a char > 255. Length and characters work ok, but for output Perl detects that single-byte encoding cannot output the string correctly. It switches to UTF-8 encoding with a warning. my $string_4 = $string_0 . $string_3; mydump($string_4); SV = PV(0x811f694) at 0x8128d54 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x81256c8 "\302\243\342\230\272"\0 [UTF8 "\x{a3}\x{263a}"] CUR = 5 LEN = 6 LENGTH=2. CHARS: 163,9786 Wide character in print at ./test.pl line 10. VALUE: '£☺'. Again, the >255 char forces a switch to UTF-8 encoding, and a warning. As far as I can tell, Perl 5.8 does the right thing, and people generally should not have too many problems under it. But see what Perl 5.6.1 is doing: my $string_0 = "\xa3"; mydump($string_0); SV = PV(0x80f6408) at 0x8100300 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x81018e0 "\243"\0 CUR = 1 LEN = 2 LENGTH=1. CHARS: 163 VALUE: '£'. my $string_1 = pack("U0a*","\302\243"); # Force UTF8 internal representation. mydump($string_1); SV = PV(0x80f6408) at 0x8100300 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x81018e0 "\302\243"\0 CUR = 2 LEN = 3 LENGTH=1. CHARS: 163 VALUE: '£'. Perl 5.6.1 gives _different_ output for the _same_ string, depending on the internal representation. This is bad, which is probably the reason why the 5.6.1 docs warns that utf8 support is not stable in that version. I guess the issue is simply that XML::Parser (and derivatives) should not use have utf8 features by default in Perl 5.6.1, since its use isn't stable until 5.8. Likewise, I think DBI and drivers should not by default enable utf8 support in Perl < 5.8 (but please give an option to enable it for those of us who need to use UTF-8 in Perl 5.6.1, and know what we are doing). - Kristian.