On Wed, 2005-08-31 at 11:00 +0200, Jérôme Baumgarten wrote:
> On 8/30/05, Emmanuel Lecharny <[EMAIL PROTECTED]> wrote:
> > > Also, accents in a filter are incorrectly received (decoded ?) in
> > > SearchHandler, for example the filter (sn=*é*) is retrieved as
> > > (sn=*Ã(c)*).
> >
> > Are you using UTF-8 to encode your string? Data are stored in UTF-8
> > format in Ldap.
>
> I did some other tests and I get the following (clients and server
> running on a Windows box) :
>
> * JXplorer (build JXv3.2b2 2005-08-18 13:46 EST) : filter is
> incorrect w.r.t accents
>
> * Softerra LDAP Browser 2.6 : filter is incorrect w.r.t accents
>
> * JNDI test code : filter is incorrect w.r.t accents
>
> * JLDAP test code : filter is incorrect w.r.t accents
>
> * OpenLDAP ldapsearch (but running on a Linux box) : filter is
> correct w.r.t accents
>
> I can fix these problems if I do the following :
>
> String filter = LdapProxyUtils.filterToString(request.getFilter());
> try {
> filter = new String(filter.getBytes(), "UTF-8");
> } catch (UnsupportedEncodingException ueEx) {
> throw new RuntimeException(ueEx);
> }
>
> But I don't really understand why I must do so since "RFC 2254 - The
> String Representation of LDAP Search Filters" says that it is
> represented as an UTF-8 string. Thus I would expect the filter value
> to be correct, no matter the platform my LDAP proxy is running on.
It's not a question of tool or platform. Values are stored in UTF-8 in
LDAP if they are Strings (from RFC 2251) :
"
4.1.2. String Types
The LDAPString is a notational convenience to indicate that, although
strings of LDAPString type encode as OCTET STRING types, the ISO
10646 [13] character set (a superset of Unicode) is used, encoded
following the UTF-8 algorithm [14]. Note that in the UTF-8 algorithm
characters which are the same as ASCII (0x0000 through 0x007F) are
represented as that same ASCII character in a single byte. The other
byte values are used to form a variable-length encoding of an
arbitrary character."
So you must send String values encoded in UTF-8 when requesting a Ldap Server.
If you use a tool,
there is good chance that a convversion is done from your locale to UTF-8 (ie
ISO-8859-1 to UTF-8 in your case).
If you write a piece of code to send requests to LDAP, you *MUST* do this
conversion yourself. Using
a simple new String("Jérome") is not enough, as it will internally encode
"Jérôme" using UTF-16.
So you always should use a new String("Jérôme", "UTF-8") before sending data to
Ldap. It applies to search filters, too.
> Also, has anyone tested search on ApacheDS with filter containing
> accents ? The problems I'm facing right now may also be present with
> ApacheDS.
Sure we have problem with accents !!! Strings are created in ApacheDs
using new String(byte[] data) without using a UTF-8 encoding. So this is
a bug. It would be cool to add a JIRA issue with a simple test case.
However, we are actually tracking down a bug related to encoding and
binary values, it may fix your problem.
Emmanuel Lécharny