I looked at the Unicode plugin and I believe it most likely will
break the
integration against our LDAP backend, for example when searching for
names
containing characters like æøå. (OpenLDAP requires its input as
UTF-8.)
In addition, this is bad if your code (or templates) contains
special unicode
characters; which then becomes double-encoded.
The Unicode plugin looks like could be useful if you are migrating
old data or
an old website that didn't use UTF-8 before. It is definitely not the
solution for me, as it means more data processing and might
introduce new
bugs.
As I said in my first post, the solution (which works for me) was to
turn off
the Decode parameter. This makes more sense to me now, since my mo/
po-files
are already in UTF-8 and don't need to be converted.
Right, I think there is some confusion on your part as to what is the
proper way of handling unicode in perl.
(The basic problem is that "perl's magic internal representation" just
happens to look exactly like UTF-8 plus a magic flag. Longer
description below)
First off, you need to understand the difference between characters
and bytes/octets
"æøå" is a character string
"\303\246\303\270\303\245" is a utf8 byte sequence != a string
"\303\246\303\270\303\245" + UTF8 flag = "æøå" perl string
From perldoc perlunicode
... What the "UTF8" flag means is
that the
sequence of octets in the representation of the scalar is
the
sequence of UTFâ8 encoded code points of the characters
of a
string. The "UTF8" flag being off means that each octet
in this
representation encodes a single character with code point
0..255
within the string. Perl's Unicode model is not to use
UTF-8 until
it is absolutely necessary.
The problem lies in that you can have two strings of data that look
the same when you print them, lets take the example you gave of "æøå".
If this data comes from a source that doesn't set the UTF8 flag, the
SV (scalar value - where perl internals store scalars) will have the
characters of
"\303\246\303\270\303\245"
However since non of these code points are above 255 (they cant be as
each character = one byte) perl thinks this isn't a utf8 string.
Devel::Peek is a good module for this:
DB<3> x $foo = "\303\246\303\270\303\245"
0 'æøå'
DB<4> Dump($foo)
SV = PV(0x918d08) at 0x926848
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x5ace10 "\303\246\303\270\303\245"\0
CUR = 6
LEN = 8
It "looks right", but wait - LEN = 8. Perl thinks its a string of 8
characters that our terminal just happens to print right.
Compare that with:
DB<6> x $bar = "\x{E6}\x{F8}\x{E5}"
0 '???'
DB<7> Dump($bar)
SV = PV(0x9398dc) at 0x9306b4
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x5acbf0 "\346\370\345"\0
CUR = 3
LEN = 4
Still not quite what we want...
DB<10> Dump($baz = Encode::decode("utf8", $foo))
SV = PVMG(0x974e20) at 0x974168
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x656d30 "\303\246\303\270\303\245"\0 [UTF8
"\x{e6}\x{f8}\x{e5}"]
CUR = 6
LEN = 8
MAGIC = 0x6575e0
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = 3
Right, *now* $baz is a proper unicode string that perl knows is a
string of UTF8 *characters*
To relate this to your problem, you are getting some of your data
double encoded because the data (from the perl module you are using to
access your LDAP server) is returning a byte sequence that perl
doesn't know is supposed to be UTF8.
The answer is to do Encode::decode("utf8", $utf8_byte_sequence)
on all the data coming back from your LDAP server (or to find the
right option to make the module you are using do it).
Any of this make any sense?
PS. It seems that even Apple has problems with UTF8. In writing this
email I saved it in my drafts folder. When I came back to edit it
again, the non-ascii characters got fluffed up. Fun eh?
_______________________________________________
List: [email protected]
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/[EMAIL PROTECTED]/
Dev site: http://dev.catalyst.perl.org/