Peter Marschall wrote:

>But IIRC this thread started with your complaint that perl-ldap does no
>automatic conversion (controlled by parameters to Net::LDAP->new)
>of string attributes from the local character set to UTF-8.
>If I got that right, that means that you do not convert from your "local 
>character set" to to UTF-8 in your application but let your version of 
>perl-ldap do it.
>That again means that your version of perl-ldap expects strings in your local 
>character set and interprets the strings it gets in that character set.
>Now if you have a string in a different character set, what do you do now ? 
>That's what my example with German and Chech with the condition "during the 
>same connection" is about.

I do not have strings in different character sets. While there may
exist programs that do, most do not.
The way I added to the API allowes your very uncommon case too.

>The problem is that your version of perl-ldap can not be fed strings that 
>cannot be represented in the "local" character set. It also cannot interpret 
>correctly strings in character sets different from your local character set.
>E.g. if you have perl-ldap set to convert strings from Latin1 to UTF-8, it 
>will interpret 0xD2 as "LATIN CAPITAL LETTER O WITH GRAVE".
>Unfortunately the string containing 0xD2 was a Latin2 string where
>0xD2 is "LATIN CAPITAL LETTER N WITH CARON". 
>Ooops, information lost !!
But I have no strings in Latin2. I always you ONE character set inside
my applications. It makes no sense to complicate the world by having
many at the same time.

>And what about characters that need more than one byte: chinese,
>japanese or mathematical symbols like 0x2228 "NOT PARALLEL TO" ?
>when doing conversion from Latin1 those cannot even be fed into
>the API.
If I want to handle those I would not use Latin1 as local character set.


>> Inside an application I prefer to use ONE character set encoding as
>> working with character data gets so much easier then.
>
>Accepted. But this character set must be capable of representing all
>legal characters for the API or else you will loose information or forbid
>some use cases.
>That means using Latin1 (or any other 8bit charqacter set)
>will restrict your possible use cases (see my examples).

Yes it will. But if you look at my current use the local character set
is Latin1 and no data in my LDAP directory will contain anything
with characters outside Latin1.
I expect it is the same in most places in the world - you use data
that is handled by the local character set in use.

>
>If you have to take Unicode anyway, you now can choose between
>the various representations. So, why not taking UTF-8 if it is the
>optimal encoding for perl-ldap (since it does not need any mapping) ?

UTF-8 is not optimal for character handling.


>Why not doing it right and doing the conversion when reading from byte 
>oriented input / writing to byte oriented output such as files, terminals
>and working with UTF-8 inside your application.
>Once the date crosses the border from the outside (file, terminal)
>you are safe since you are in UTF-8.

If I want to expand outside Latin1 I would use UCS-2 or UCS-4 as those
can be handled easily and efficiently.
UTF-8 can be good for storage and interoperability. It is not good to
use internally for programs handling characters (except for those
only handling ASCII characters). By handling characters I mean more that
just copying bytes. 

So I would still ned to convert from transport format UTF-8 to internal
format UCS-2 (UCS-4).


>This is a kludge. You are fattening the API making, it more error prone,
>more obscure for more complicated use cases, slower and restrict use cases in 
>general just to stay in 8bit instead of doing it correct with UTF-8.

As I have added it as an option to "new" it does not change anything
for those not enabling the option. And it makes things MUCH easier for a lot
of applications.

   Dan

Reply via email to