Re: How to handle character set in perl-ldap?

Peter Marschall Sat, 16 Aug 2003 08:28:16 -0700

Hi Graham,

On Wednesday 13 August 2003 22:01, Graham Barr wrote:
> On Wed, 2003-08-13 at 20:50, Kurt D. Zeilenga wrote:
> > I don't think the LDAP API should do any trancoding, the API should
> > be a simple conduit between the application and the wire.  Any
> > transcoding desires should be done by the application (by directly
> > calling APIs specifically designed to do transcoding).
>
> I have been following this thread a bit. And while I would agree that
> the API (Net::LDAP in this case) should not do transcoding, I don't see
> any reason why it cannot provide hooks to make the application
> developers life easier.


I remember a thread this spring about perl-ldap, Convert::ASN1 and Unicode 
support.
There were some issues with character semantics.
Are they still there or is it possible to feed strings with character 
semantics into perl-ldap and get strings in character semantics back ?

Oops, wanting them back in character semantics is dangerous (as I wrote in 
various previous mails in this thread ;-)), because it needs knowledge of the 
data (schema, ...) But maybe here an option to get_value() can help.

The idea is to have a string e.g "Hägar" (Latin1) in character semantics
"HÃ¤gar" (UTF8, but also only 5 characters long because of the character
semantics: the "Ã¤" is 2 bytes, but only one character) and accept this
regular Perl string as an input to operations in perl-ldap.
Of course even a string like "Any \x{0021} string \N{SMILEY FACE}" should
work  ;-)

I know it works with byte semantics where "HÃ¤gar" is a string of length() six 
and Perl has no idea that the "Ã" and the "¤" are actually the UTF8 encoding 
for the Latin1 "ä".

I also know the character semantics will not work with versions < 5.8 because 
there the Unicode support was not so complete (IIRC the utf8 flag was 
lexically scoped and not an attribute of each variable).

For the checks whether a string is in byte or character semantics and the
appropriate conversion from character semantics to byte semantics the Encode
module (or Perl's 5.8.1 utf8 package) should do 
e.g. with Encode: 
  # convert to byte semantics if string is in character semantics
  $octets = encode("utf8", $string)  if (is_utf8($string));

When reading an attribute's value, an additional option [e.g. chars => 1]
can tell get_value() to use character semantics instead of the default byte 
semantics. This allows the user to get Perl strings from attributes he knows 
to be encoded in UTF8.
e.g.
  # get givenName as a string in character semantics
  $string = $entry->get_value('givenname', chars => 1)
  # get jpegPhoto as a sequence of bytes
  $octets = $entry->get_value('jpegPhoto');

Peter

PS: Having written this I notice that the conversion from character semantics 
strings to UTF8-encoded byte semantics in perl-ldap when writing might be
risky too since it assumes that this attribute is UTF8 encoded in LDAP.
Here, I think, the risk is tolerable as long as byte semantics is supported in 
perl-ldap and the behaviour with character semantics is explanied in the man 
page.

PPS: For me this is all pure theory since I only have Perl 5.6.0 at work (and 
for compatibility's sake at home too).

-- 
Peter Marschall
eMail: [EMAIL PROTECTED]

Re: How to handle character set in perl-ldap?

Reply via email to