Re: How to handle character set in perl-ldap?

Peter Marschall Thu, 14 Aug 2003 12:21:03 -0700

Hi,

On Monday 11 August 2003 13:27, Dan Oscarsson wrote:
> >At a first glance the second case seems easier for the application
> >programmer, but it is really broken:
> >Consider the following cases:
> >1) A German and a Czech shall be added to the directory
> >    during the same connection.
> >    Each one might have attributes that need to be
> >    converted from the resp. char set (Latin1 vs Latin2)
> >2) LDAP allows different represenations for the same attribute
> >    (using the lang-.. qualifiers). Now, try to add a chinese
> >    name and it's transcription to Latin1 (or Latin2 ....)
> >    in one LDAP session.
>
> All attributes use UTF-8 in LDAP, even with different lang- qualifiers.
> If you have a program with one connection internally in the program
> you do not have German and Czech strings using different character sets.
> So neither 1) nor 2) is any problem.


You are right. With the current API it isn't a problem.

But IIRC this thread started with your complaint that perl-ldap does no
automatic conversion (controlled by parameters to Net::LDAP->new)
of string attributes from the local character set to UTF-8.
If I got that right, that means that you do not convert from your "local 
character set" to to UTF-8 in your application but let your version of 
perl-ldap do it.
That again means that your version of perl-ldap expects strings in your local 
character set and interprets the strings it gets in that character set.
Now if you have a string in a different character set, what do you do now ? 
That's what my example with German and Chech with the condition "during the 
same connection" is about.

> >Both cases are absolutely legal and possible with the current API.
> >With an API that is not ablsolutley transparent, they will not work.
>
> What is the problem? All attributes in LDAPv3 uses UTF-8 encoding.

The problem is that your version of perl-ldap can not be fed strings that 
cannot be represented in the "local" character set. It also cannot interpret 
correctly strings in character sets different from your local character set.
E.g. if you have perl-ldap set to convert strings from Latin1 to UTF-8, it 
will interpret 0xD2 as "LATIN CAPITAL LETTER O WITH GRAVE".
Unfortunately the string containing 0xD2 was a Latin2 string where
0xD2 is "LATIN CAPITAL LETTER N WITH CARON". 
Ooops, information lost !!
And what about characters that need more than one byte: chinese,
japanese or mathematical symbols like 0x2228 "NOT PARALLEL TO" ?
when doing conversion from Latin1 those cannot even be fed into
the API.

> >But then we have the same problem: How can I enter a character
> >of a character set different from my default input character set ?
>
> Inside an application I prefer to use ONE character set encoding as
> working with character data gets so much easier then.

Accepted. But this character set must be capable of representing all
legal characters for the API or else you will loose information or forbid
some use cases.
That means using Latin1 (or any other 8bit charqacter set)
will restrict your possible use cases (see my examples).

If you have to take Unicode anyway, you now can choose between
the various representations. So, why not taking UTF-8 if it is the
optimal encoding for perl-ldap (since it does not need any mapping) ?

> So it is during input/output the translation takes place.
> If my program uses a protocol, the translation from internal
> character encoding to protocol encoding will take place when data
> is to betransferred to the protocol. ...

Why not doing it right and doing the conversion when reading from byte 
oriented input / writing to byte oriented output such as files, terminals
and working with UTF-8 inside your application.
Once the date crosses the border from the outside (file, terminal)
you are safe since you are in UTF-8.

> ...If possible I want the fact that
> the protocol uses another format to be invisible. It should be hidden
> under the APIs. Sometimes you want to access the protocol encoding
> directely, so an API should allow it to be exposed.
> Most people will prefer not to have to think about it.

This is a kludge. You are fattening the API making, it more error prone,
more obscure for more complicated use cases, slower and restrict use cases in 
general just to stay in 8bit instead of doing it correct with UTF-8.

But wait: AFAIK Net::LDAP is subclassable. You do not need to change Net::LDAP 
but can make your private sublass that does it with conversion if you want 
it.

Peter

-- 
Peter Marschall
eMail: [EMAIL PROTECTED]

Re: How to handle character set in perl-ldap?

Reply via email to