On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote: > So the UTF8 flag is enabled, and yet it has "\303\204\302\215" in it. What is > that crap?
That's octal notation, which I think Dump() uses for any byte greater than 127 and for control characters, so that it can output pure ASCII. That sequence is only four bytes: mar...@smokey:~ $ perl -MEncode -MDevel::Peek -e '$s = "\303\204\302\215"; Encode::_utf8_on($s); Dump $s' SV = PV(0x801038) at 0x80e880 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x2012f0 "\303\204\302\215"\0 [UTF8 "\x{c4}\x{8d}"] CUR = 4 <----------------------------------------------- four bytes LEN = 8 mar...@smokey:~ $ The logical content of the string follows in the second quote: > [UTF8 "<p>Tomas Laurinavi\x{c4}\x{8d}ius</p>"] That's valid UTF-8. > my $str = '<p>Tomas Laurinavi????ius</p>'; In source code, I try to stick to pure ASCII and use \x escapes -- like Dump() does. my $str = "<p>Tomas Laurinavi\x{c4}\x{8d}ius</p>" However, because those code points are both representable as Latin-1, Perl will create a Latin-1 string. If you want to force its internal encoding to UTF-8, you need to do additional work. mar...@smokey:~ $ perl -MDevel::Peek -e '$s = "\x{c4}"; Dump $s; utf8::upgrade($s); Dump $s' SV = PV(0x801038) at 0x80e870 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x2012e0 "\304"\0 CUR = 1 LEN = 4 SV = PV(0x801038) at 0x80e870 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x2008f0 "\303\204"\0 [UTF8 "\x{c4}"] CUR = 2 LEN = 3 mar...@smokey:~ $ > Confused and frustrated, IMO, to get UTF-8 right consistently in a large Perl system, you need to understand the internals and you need Devel::Peek at hand. Perl tries to hide the details, but there are too many ways for it to fail silently. ("perl -C", $YAML::Syck::ImplicitUnicode, etc.) Marvin Humphrey