Hi,
Currently, we create a forward and a reverse index for each index
attribute. The forward index is used when we annotate a filter,
searching for the number of candidate it filters. The reverse index
usage is a bit more subtile.
Let's consider a filter like (cn=jo*). As we don't have any clue about
the value (it will start with 'jo', but that's all we know), we can't
use the 'cn' forward index. We can't use the 'cn' reverse index either,
at least directly. So the engine will use the scope to determinate a
list of candidate : {E1, E2, ... En}. Now, instead of fetching all those
entries from the master table, what we can do is to use the 'cn' reverse
table, as we have a list of entry IDs.
For each entry id in {E1, E2, ... En}, we will check if the 'cn' reverse
table contain some value starting with 'jo'.
If th 'cn' index contains the two following table :
forward =
john --> {E1, E3}
joe --> {E2, E3, E5}
...
reverse =
E1 --> john
E2 --> joe
E3 --> joe, john
E5 --> joe
...
then using the reverse table, we will find that the E1, E2, E3 and E5
entry match, when all the other aren't. No need to fetch the E4, ... En
entries from the master table.
Now, exploiting this rverse table means we read a btree. Which has the
same cost than reading the Master Table (except that we won't have to
deserialize the entry).
What if we don't have a reverse table ?
We will have to deserialize the entries, all of them from the {E1, E2,
... En} set filtered by the scope.
Is this better then to build a reverse index ? Not necessarily. In the
case where we have more than one node in the filter, for instance
(&(cn=jo*)(sn=test)(objectClass=person)), then using the reverse index
means we will access as many btrees as we have index attributes in the
filter node. Here, if cn, sn and objectClass are indexed, we will
potentially access 4 btrees (the scope will force us to consider using
the oneLevel index or the subLevel index).
At the end, it might be more costly, compared to using the entry and
match it against the nodes in the filter.
When we have many entries already in the cache, thus sparing the cost of
a deserialization, then accessing more than one BTree might be costly,
comparing to using the entries themselves.
An alternative
--------------
The pb is that the Node in the filter use a substring : jo*. Our index
are built using the full normalized value. That does not fit.
What if instead of browsing the underlying tree, we use a specific
compare() method instead of an equals() method ? As the tree is ordered,
finding the subset of entries taht will be valid candidate is just a
matter of finding the part of the tree which match the comparison.
In our example, the compare() method will compare the first caracters
from the keys to the filter value. Here, the compare method will be the
String.startWith(), and if it's not true, we will do a compareTo() to
know if we should go donw th tree on the left or on the right of the
current key.
For that, we just need the forward index, the reverse index is useless.
If we use a substring with a '*' at the beginning, then we need another
index : a reverted index. Tht means all the values will be stored
reverted : john will be stored as 'nhoj', so doing a search on (cn=*hn)
will be fast.
Thoughts ?
--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com