I looked at the Unicode plugin and I believe it most likely will break the
integration against our LDAP backend, for example when searching for names containing characters like æøå. (OpenLDAP requires its input as UTF-8.)

In addition, this is bad if your code (or templates) contains special unicode
characters; which then becomes double-encoded.


The Unicode plugin looks like could be useful if you are migrating old data or
an old website that didn't use UTF-8 before. It is definitely not the
solution for me, as it means more data processing and might introduce new
bugs.


As I said in my first post, the solution (which works for me) was to turn off the Decode parameter. This makes more sense to me now, since my mo/ po-files
are already in UTF-8 and don't need to be converted.

Right, I think there is some confusion on your part as to what is the proper way of handling unicode in perl.

(The basic problem is that "perl's magic internal representation" just happens to look exactly like UTF-8 plus a magic flag. Longer description below)


First off, you need to understand the difference between characters and bytes/octets

"æøå" is a character string
"\303\246\303\270\303\245" is a utf8 byte sequence != a string

"\303\246\303\270\303\245" + UTF8 flag = "æøå" perl string

From perldoc perlunicode

... What the "UTF8" flag means is that the sequence of octets in the representation of the scalar is the sequence of UTF−8 encoded code points of the characters of a string. The "UTF8" flag being off means that each octet in this representation encodes a single character with code point 0..255 within the string. Perl's Unicode model is not to use UTF-ˆ’8 until
           it is absolutely necessary.

The problem lies in that you can have two strings of data that look the same when you print them, lets take the example you gave of "æøå". If this data comes from a source that doesn't set the UTF8 flag, the SV (scalar value - where perl internals store scalars) will have the characters of

  "\303\246\303\270\303\245"

However since non of these code points are above 255 (they cant be as each character = one byte) perl thinks this isn't a utf8 string. Devel::Peek is a good module for this:

  DB<3> x $foo = "\303\246\303\270\303\245"
0  'æøå'
  DB<4> Dump($foo)
SV = PV(0x918d08) at 0x926848
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x5ace10 "\303\246\303\270\303\245"\0
  CUR = 6
  LEN = 8

It "looks right", but wait - LEN = 8. Perl thinks its a string of 8 characters that our terminal just happens to print right.

Compare that with:

  DB<6> x $bar = "\x{E6}\x{F8}\x{E5}"
0  '???'
  DB<7> Dump($bar)
SV = PV(0x9398dc) at 0x9306b4
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x5acbf0 "\346\370\345"\0
  CUR = 3
  LEN = 4

Still not quite what we want...

  DB<10> Dump($baz = Encode::decode("utf8", $foo))
SV = PVMG(0x974e20) at 0x974168
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  IV = 0
  NV = 0
PV = 0x656d30 "\303\246\303\270\303\245"\0 [UTF8 "\x{e6}\x{f8}\x{e5}"]
  CUR = 6
  LEN = 8
  MAGIC = 0x6575e0
    MG_VIRTUAL = &PL_vtbl_utf8
    MG_TYPE = PERL_MAGIC_utf8(w)
    MG_LEN = 3


Right, *now* $baz is a proper unicode string that perl knows is a string of UTF8 *characters*



To relate this to your problem, you are getting some of your data double encoded because the data (from the perl module you are using to access your LDAP server) is returning a byte sequence that perl doesn't know is supposed to be UTF8.

The answer is to do Encode::decode("utf8", $utf8_byte_sequence)
on all the data coming back from your LDAP server (or to find the right option to make the module you are using do it).

Any of this make any sense?


PS. It seems that even Apple has problems with UTF8. In writing this email I saved it in my drafts folder. When I came back to edit it again, the non-ascii characters got fluffed up. Fun eh?



_______________________________________________
List: [email protected]
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/[EMAIL PROTECTED]/
Dev site: http://dev.catalyst.perl.org/

Reply via email to