Re: LDAP protocol implementation and data containing accents

Emmanuel Lecharny Wed, 31 Aug 2005 02:34:08 -0700

On Wed, 2005-08-31 at 11:00 +0200, Jérôme Baumgarten wrote:
> On 8/30/05, Emmanuel Lecharny <[EMAIL PROTECTED]> wrote:
> > > Also, accents in a filter are incorrectly received (decoded ?) in
> > > SearchHandler, for example the filter (sn=*é*) is retrieved as
> > > (sn=*Ã(c)*).
> > 
> > Are you using UTF-8 to encode your string? Data are stored in UTF-8
> > format in Ldap.
> 
> I did some other tests and I get the following (clients and server
> running on a Windows box) :
> 
>  * JXplorer (build JXv3.2b2 2005-08-18 13:46 EST) : filter is
> incorrect w.r.t accents
> 
>  * Softerra LDAP Browser 2.6 : filter is incorrect w.r.t accents
> 
>  * JNDI test code : filter is incorrect w.r.t accents
> 
>  * JLDAP test code : filter is incorrect w.r.t accents
> 
>  * OpenLDAP ldapsearch (but running on a Linux box) : filter is
> correct w.r.t accents
> 
> I can fix these problems if I do the following :
> 
> String filter = LdapProxyUtils.filterToString(request.getFilter());
> try {
>   filter = new String(filter.getBytes(), "UTF-8");
> } catch (UnsupportedEncodingException ueEx) {
>   throw new RuntimeException(ueEx);
> }
> 
> But I don't really understand why I must do so since "RFC 2254 - The
> String Representation of LDAP Search Filters" says that it is
> represented as an UTF-8 string. Thus I would expect the filter value
> to be correct, no matter the platform my LDAP proxy is running on.


   

It's not a question of tool or platform. Values are stored in UTF-8 in
LDAP if they are Strings (from RFC 2251) :

"
4.1.2. String Types
The LDAPString is a notational convenience to indicate that, although
   strings of LDAPString type encode as OCTET STRING types, the ISO
   10646 [13] character set (a superset of Unicode) is used, encoded
   following the UTF-8 algorithm [14]. Note that in the UTF-8 algorithm
   characters which are the same as ASCII (0x0000 through 0x007F) are
   represented as that same ASCII character in a single byte.  The other
   byte values are used to form a variable-length encoding of an
   arbitrary character."

So you must send String values encoded in UTF-8 when requesting a Ldap Server. 
If you use a tool, 
there is good chance that a convversion is done from your locale to UTF-8 (ie 
ISO-8859-1 to UTF-8 in your case).

If you write a piece of code to send requests to LDAP, you *MUST* do this 
conversion yourself. Using 
a simple new String("Jérome") is not enough, as it will internally encode 
"Jérôme" using UTF-16.

So you always should use a new String("Jérôme", "UTF-8") before sending data to 
Ldap. It applies to search filters, too.


> Also, has anyone tested search on ApacheDS with filter containing
> accents ? The problems I'm facing right now may also be present with
> ApacheDS.

Sure we have problem with accents !!! Strings are created in ApacheDs
using new String(byte[] data) without using a UTF-8 encoding. So this is
a bug. It would be cool to add a JIRA issue with a simple test case. 

However, we are actually tracking down a bug related to encoding and
binary values, it may fix your problem. 

Emmanuel Lécharny

Re: LDAP protocol implementation and data containing accents

Reply via email to