Substring filter : how to improve their use ?

Emmanuel Lécharny Thu, 21 Feb 2013 10:15:13 -0800

Hi guys,

currently, when we do a search, we try to evaluate the number of
candidates each node in the filter expression will return. For instance,
such a filter :


(&(cn=test)(sn=acme))

will evaluate the number of entries with a CN = test and the number of
entries with a SN = acme. Let say the former has 10 entries and the last
has 5 entries, we will annotate each node with this number :


(&:[5](cn=test:[10])(sn=acme:[5]))

Here, the & will have the lowest number, and at the end, we will select
the node whith the lowest number as a base to fetch the entries.

So far, so good (I don't ant to go any further about how we evaluate
other operators, like | and !, it's way more complex but irrelevant here).

Every filter will also have an additional node added : the scope. The
scope allows us to limit the number of entries to fetch, depending on
where in the tree we are. That can save some time.

In our example, let's say we have 1000 entries under ou=system, 100
under ou=users,ou=system and 3 under ou=unitTest,ou=users,ou=system. The
filter will be transformed internally to this filter (assuming the scope
is SUBTREE) :

(&(scope=SUB)(cn=test)(sn=acme))

and annoted as :


base search = "ou=system" :
(&:[5](scope=SUB[1000](cn=test:[10])(sn=acme:[5]))

base search = "ou=users,ou=system" :
(&:[5](scope=SUB[100](cn=test:[10])(sn=acme:[5]))

base search = "ou=unitTest,ou=users,ou=system" :
(&:[3](scope=SUB[3](cn=test:[10])(sn=acme:[5]))

As we can see, the scope can change the evaluation. In the last case, we
will pick the scope node to fetch the entries, as we will only get 3 of
them.

Ok, let's go a step further. Those filters were pretty simple to handle,
as the underlying database know how many entries will be fetched when
handling a filter like (cn=test).

Substring, <=, >= filters are way different. We can't know many entries
will be >= a key, or which match the substring. For instance, how do we
know how many entries will match cn=te* ? We can find the first key
matching cn=te in the index, but we will have to walk the index up (or
down) to find the entries matching cn=tf*. That means we have acess to
the ordering method, to build the key which comes after 'te' (ie, 'tf'),
something which is quite difficult to do atm...

So...

I modified the evaluation for the substring/lesserThen/greaterThan
indexed filters, to use a count which is different from the number of
element in the index. Right now, the count will always be 10. Why 10 ?
Why a fixed count ?

- A fixed count is assuming that the user *knows* what he/she is doing
when doing a search. Of course, some might do things like (cn=a*), which
might return thousands of candidate, but let's say it's not likely.
- 10 because it gives a chance that a better index can be selected.

I don't really like the '10' value, but it works well. Keep in mind that
it's just a lower bound. That does not mean the filter will just return
10 candidate max ! It's just a way for the optimizer to select the
substring index instead of defaulting to something else with tens of
thousands of candidates.
May be 100 would be better...

Keep in mind that if there are only 2 candidates that match the filter,
we will have the filter selected with 10 as a max count, but we will not
fetch 10 candiates, just 2 !

I have modified the server to work this way, if you have better ideas,
I'm listening !

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Substring filter : how to improve their use ?

Reply via email to