Re: How to handle character set in perl-ldap?

Chris Ridd Thu, 14 Aug 2003 12:54:48 -0700

On 7/8/03 9:28 am, Dan Oscarsson <[EMAIL PROTECTED]> wrote:

> Chris wrote:
>> This is a bad idea IMHO.
>> 
>> Firstly, not everyone uses ISO 8859-1 as their local character set, so you'd
>> have to make this switchable.
> 
> In the general case it should be between UTF-8 and character set of locale.


Maybe, but that might stop your scripts from working when they moved to a
machine with a different locale. This isn't a big deal, it is a configurable
option.

> character set instead. Though there are some LDAPv2 servers that use
> other more non-standard character sets (T.61 was the X.500 standard).

There certainly are, even though T.61 was the LDAPv2 standard as well :-)

> The LDAP standard is unfortunately bad in this area. It should have use
> OCTET STRING for binary data and UTF-8 STRING for text data.

Perhaps.

> And when LDAPv3 was introduced they did not even require ;binary on
> binary data (it ought to be jpegPhoto;binary). So it is a problem.

I'm happy with the binary attribute type description, but it doesn't mean
what many people think it means. It doesn't mean the value's binary, for one
thing :-))

> The only good way I can see is to define that all DN, RDN, password and
> all attributes in a list (I might have forgotten something) to be translated.
> The attribute list could actually list the binary attributes as most are text.

The only way you can do it is use the schema, and then build in knowledge
about what each syntax requires. You'd have to make that extensible, because
new syntaxes are defined in some environments and markets.

>> If you go for a half-way solution, like anything in LDAPv3 defined as
>> LDAPString (like DNs, attribute types) being handled as UTF-8 and everything
>> else being raw octets, it will get very confusing to the calling script.
>> What should be in UTF-8, what shouldn't be?? Should 'cn=Chris Ridd' as a
>> value of seeAlso be encoded in UTF-8 or sent raw? In this case it doesn't
>> make a difference, but you can see how it might.
> 
> I cannot see that it would be confusing. Everything that is a string is
> passed between user code and LDAP module as strings are normally coded

I meant if you had to encode attribute values (OCTET STRINGs) yourself,
while the API magically treated LDAPString values as UTF-8, then you'd need:

    $ldap->add ( 'cn=Chris Ridd',
                 attrs => [
                   objectClass => [qw(top person extensibleObject)],
                   cn => String2utf8('Chris Ridd'),
                   sn => String2utf8('Ridd'),
                   myAttr => $foo,
                   seeAlso => String2utf8('cn=Chris Ridd')
                 ]
               );

You would have to wrap the seeAlso value in something (eg String2utf8) to
translate the string into UTF-8 before sending to the directory, while you
would *not* have to translate the entry's DN into UTF-8 as you're proposing
that the API does that.

*That¹s* confusing - they're both DNs, and worse than now where the caller
has to know what to do.

> (using the character set of all other strings). Everything that is
> binary data is returned as binary data (in perl strings as the interface
> do not have a separate data type for binary data).
> 
> As it is now it is very error prone and confusing. For every call
> I make I have to remember call translate-to-utf-8(string) on every
> parameter that is a string. For example:
> $ldap->bind(String2utf8("cn=xåx,o=example"),
>             password=>String2utf8("myåpasswd"));

Password's an interesting example because it is defined to be OCTET STRING
and the character set used in it is defined to be a "local matter".


> $ldap->add(String2utf8("cn=xö,o=example"),
>            attrs => [ sn => String2utf8('xää') ]);
> 
> If I forget one String2utf8 above everything ends up wrong.
> 
> In my directory I have non-ASCII everywhere. It is very messy to have
> to translate to utf-8 (or T.61 in LDAPv2) everywhere.

I agree.

> A API should use local character set as default and internally convert
> to protocol character set. It could be an option to enable exposing
> protocol character set.

That isn't the case with other APIs. The C libldap API just sends the raw
bytes to and fro. The various Java APIs use Unicode strings, because that's
what's native to Java anyway.

There's certainly scope for doing all this in a schema-aware version of the
Net::LDAP API, but I think the way Net::LDAP currently works without messing
around with the data is also very useful.

>  Dan
> 
> 

Cheers,

Chris

Re: How to handle character set in perl-ldap?

Reply via email to