Re: [Catalyst] Problem with Catalyst::Plugin::I18N using UTF-8

Ash Berlin Fri, 21 Dec 2007 13:50:45 -0800

I looked at the Unicode plugin and I believe it most likely willbreak the
integration against our LDAP backend, for example when searching fornamescontaining characters like Ã¦Ã¸Ã¥. (OpenLDAP requires its input asUTF-8.)
In addition, this is bad if your code (or templates) containsspecial unicode
characters; which then becomes double-encoded.
The Unicode plugin looks like could be useful if you are migratingold data or
an old website that didn't use UTF-8 before. It is definitely not the
solution for me, as it means more data processing and mightintroduce new
bugs.
As I said in my first post, the solution (which works for me) was toturn offthe Decode parameter. This makes more sense to me now, since my mo/po-files
are already in UTF-8 and don't need to be converted.

Right, I think there is some confusion on your part as to what is theproper way of handling unicode in perl.

(The basic problem is that "perl's magic internal representation" justhappens to look exactly like UTF-8 plus a magic flag. Longerdescription below)

First off, you need to understand the difference between charactersand bytes/octets


"æøå" is a character string
"\303\246\303\270\303\245" is a utf8 byte sequence != a string

"\303\246\303\270\303\245" + UTF8 flag = "æøå" perl string

From perldoc perlunicode

... What the "UTF8" flag means isthat thesequence of octets in the representation of the scalar isthesequence of UTFâ8 encoded code points of the charactersof astring. The "UTF8" flag being off means that each octetin thisrepresentation encodes a single character with code point0..255within the string. Perl's Unicode model is not to useUTF-8 until

           it is absolutely necessary.

The problem lies in that you can have two strings of data that lookthe same when you print them, lets take the example you gave of "æøå".If this data comes from a source that doesn't set the UTF8 flag, theSV (scalar value - where perl internals store scalars) will have thecharacters of


  "\303\246\303\270\303\245"

However since non of these code points are above 255 (they cant be aseach character = one byte) perl thinks this isn't a utf8 string.Devel::Peek is a good module for this:


  DB<3> x $foo = "\303\246\303\270\303\245"
0  'æøå'
  DB<4> Dump($foo)
SV = PV(0x918d08) at 0x926848
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x5ace10 "\303\246\303\270\303\245"\0
  CUR = 6
  LEN = 8

It "looks right", but wait - LEN = 8. Perl thinks its a string of 8characters that our terminal just happens to print right.


Compare that with:

  DB<6> x $bar = "\x{E6}\x{F8}\x{E5}"
0  '???'
  DB<7> Dump($bar)
SV = PV(0x9398dc) at 0x9306b4
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x5acbf0 "\346\370\345"\0
  CUR = 3
  LEN = 4

Still not quite what we want...

  DB<10> Dump($baz = Encode::decode("utf8", $foo))
SV = PVMG(0x974e20) at 0x974168
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  IV = 0
  NV = 0

PV = 0x656d30 "\303\246\303\270\303\245"\0 [UTF8"\x{e6}\x{f8}\x{e5}"]

  CUR = 6
  LEN = 8
  MAGIC = 0x6575e0
    MG_VIRTUAL = &PL_vtbl_utf8
    MG_TYPE = PERL_MAGIC_utf8(w)
    MG_LEN = 3

Right, *now* $baz is a proper unicode string that perl knows is astring of UTF8 *characters*

To relate this to your problem, you are getting some of your datadouble encoded because the data (from the perl module you are using toaccess your LDAP server) is returning a byte sequence that perldoesn't know is supposed to be UTF8.


The answer is to do Encode::decode("utf8", $utf8_byte_sequence)

on all the data coming back from your LDAP server (or to find theright option to make the module you are using do it).


Any of this make any sense?

PS. It seems that even Apple has problems with UTF8. In writing thisemail I saved it in my drafts folder. When I came back to edit itagain, the non-ascii characters got fluffed up. Fun eh?




_______________________________________________
List: [email protected]
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/[EMAIL PROTECTED]/
Dev site: http://dev.catalyst.perl.org/

Re: [Catalyst] Problem with Catalyst::Plugin::I18N using UTF-8

Reply via email to