Re: How to handle character set in perl-ldap?

Peter Marschall Mon, 11 Aug 2003 06:46:37 -0700

Hi,

I fully concur with Chris' opinion about APIs:
They have to be transparent.
Character set conversion should happen at the input side of the application,
not at the interface between application and API.
The API cannot guess what the application programmer wants.

At a first glance the second case seems easier for the application
programmer, but it is really broken:
Consider the following cases:
1) A German and a Czech shall be added to the directory 
    during the same connection.
    Each one might have attributes that need to be 
    converted from the resp. char set (Latin1 vs Latin2)
2) LDAP allows different represenations for the same attribute
    (using the lang-.. qualifiers). Now, try to add a chinese
    name and it's transcription to Latin1 (or Latin2 ....)
    in one LDAP session.

Both cases are absolutely legal and possible with the current API.
With an API that is not ablsolutley transparent, they will not work.

On Thursday 07 August 2003 11:49, Dan Oscarsson wrote:
> >The only way you can do it is use the schema, and then build in knowledge
> >about what each syntax requires. You'd have to make that extensible,
> > because new syntaxes are defined in some environments and markets.
> Or let it be done as an option when doing new Net::LDAP

Neither the schema solution nor the solution with additional parameters to the 
constructor will help. See my examples above.

> In my proposal all strings will be translated (except those representing
> binary values).
> So I would not wrap any data with "String2utf8".
> So it would not be confusing at all.
This will not help. See my examples above.

> Yes I know password is a difficult thing. I have many problems with that
> in my mixed Unix/MS Windows/Mac environment.
> But the only way to get it to work, is to use the same character set
> for all passwords in a database. ...
What do you do in an international company with different character sets ?
Restrict all passwords to plain ASCII ?
Not the best idea, IMHO.

> ... As LDAPv3 says UTF-8 for strings and
> passwords entered by humans normally are strings, I would expect the
> normal case to be UTF-8 encoded passwords.
Who does the encoding from the local character set to UTF-8 ?
Most operating systems and applications that I know use 8-bit characters.
If they all ued Unicode, then converting to UTF8- would be a no-brainer
because it is 1:1 but currently the 

>
> >> A API should use local character set as default and internally convert
> >> to protocol character set. It could be an option to enable exposing
> >> protocol character set.
> >
> >That isn't the case with other APIs. The C libldap API just sends the raw
> >bytes to and fro. The various Java APIs use Unicode strings, because
> > that's what's native to Java anyway.
>
> I know that many API implementors do not think about this. I am very tired
> of having to do different translations between system character set and
> protocol character sets. Protocol character sets and formats should be
> hidden from the programmer.

Sorry, this is wrong. For a universal API there can be no such thing as
a "local" character set. An API has to support all use cases, and thus
may not impose the restriction that all values must share the same
character set.
The character set of a piece of data does not depend of anything
other than the data itself. So the notion of having a common
"local character set" is broken.

Consider the character "LATIN CAPITAL LETTER N WITH CARON"
which is represented in Latin2 as 0xD2.
So "LATIN CAPITAL LETTER N WITH CARON" is equivalent to
the tuple (Latin2, 0xD2) and this is true no matter where my "local
character set" (to be exact: the default input character set of my system)
is Latin1 or ISO-8859-15 or anything else.
Looking at the Byte 0xD2 alone is not sufficient: For example, in
Latin1 0xD2 is "LATIN CAPITAL LETTER O WITH GRAVE"

> In Java it works as it should, system character set is UTF-16 and
> the Java APIs do the translation to the protocol character set.
> Here you do not have to think about character set issues.
> In the Java LDAP API (JNDI) it has a list of attributs that are known
> to be non-string (and you can add to that list). Those attributes will
> not be translated.

To be correct: the hard work of character set conversion needs to happen when 
entering data into a Java application i.e. the Java application has to know 
about my "default input character set" and interpret the data accordingly.
I doubt if the Java VM reprograms my keyboard to send UTF-16 ;-)
But then we have the same problem: How can I enter a character
of a character set different from my default input character set ?

The mapping from UTF-16 to UTF-8 is a 1:1 mapping and no character set
conversion (both are data representations of Unicode)

Peter
-- 
Peter Marschall
eMail: [EMAIL PROTECTED]

Re: How to handle character set in perl-ldap?

Reply via email to