On 5/13/10 8:08 AM, Alex Karasulu wrote:
Hi Emmanuel,
On Sun, May 9, 2010 at 6:42 PM, Emmanuel Lecharny<[email protected]>wrote:
...
As soon as you consider that a delete operation will need to update all the
index the deleted entry uses, you see that you can benefit from having such
reverse index : you don't have to grab the entry from the backend, as you
already have it's ID, which is all what you need. Thus, a delete operation
is just about doing :
- get the entry ID
- for each index, if we have an<Entry ID> stored, then grab the associated
values, and for each value, delete them from the forward index.
- delete the entry from the MT
Yep this is one benefit of the reverse index.
At this point, I'm questionning the fact that it's an advantage. We can
do almost the same thing using the forward index, assuming we know the
attribute we are dealing with, and we also have the value in the filter,
so we can check that the entry we are looking at is present in the index
for this attribuet. Let me explain with some example :
filter : (cn=value), cn is indexed.
you are looking for entry E(x)
The reverse index will contain :
E(x) --> value
the forward index will contain :
vaue --> E(1), E(5), ... E(x)...E(N)
so you can still in a O(log(N)) find the entry in the forward index
without fetching the entry first.
There are other uses for the
reverse index. To start let's just list the various index methods leveraging
the reverse index:
boolean reverse( ID id ) throws Exception;
boolean reverse( ID id, K attrVal ) throws Exception;
boolean reverseGreaterOrEq( ID id ) throws Exception;
boolean reverseGreaterOrEq( ID id, K attrVal ) throws Exception;
boolean reverseLessOrEq( ID id ) throws Exception;
boolean reverseLessOrEq( ID id, K attrVal ) throws Exception;
In fact, those methods are not existing anymore (don't ask me why, I
just grepped all the code). I don't know where you get them from ...
The only place where the reverse index is used atm is when checking that
an entry has a value.
These various reverse lookups are used all over the XDBM search code base
with various expression Evaluators and Cursors.
Well, not anymore :)
They are used to evaluate
assertion expressions in filters during search. Without reverse indices
entries must be accessed and deserialized from the master table for every
potential candidate before full filter evaluation takes place.
Nope. You can still use the forward index. But even, it might be
valuable to grab the entry from the disk and check directly against the
filter in memory, instead of reading index. My perception is that we can
save a lot of time assuming that the set of returned entries is low, and
if you have more than a couple of filter expressions, as reading index
might lead to disk access.
Now, it all depends on the memory size. If all the index and entries are
in memory (optimal solution), then fetching an entry cost nothing, but
we save all the index brwosing by checking the entry against the filter
directly.
Of course, if we have disk access, then reading entries from the disk is
costly, but so is the index browsing, as you will have to hit the disk
too. Considering that you have N entry candidates, 1/3 of them being
valid candidate (ie willbe returned to the client), then the other
filter expressions will be used to eliminate those 2/3 of bad entries.
If you have, say, M filter expressions, then you will potentially hit
the disk while reading the (M-1) index, then when the entry fits, read
the entry from the disk too. If the entry does not fit, you just save
the MT access. So for valid entries you have an extra cost, as you read
it from disk anyway, plus all the index potential disk acces. For
invalid entries, you still have the index reads.
Of course, this is all but proven, I'm talking from a theorical POV. I'm
just trying to rationalise the agorithm, in order to have more than one
perspective about the search performance.
By no mean I think that the proposed solution is faster, I just think
that it might be a good idea to test it.
Now, an even better option : all thise should be optionnal. If we can
fit all the data in memory, then I don't think that we should use
reverse indexes. If the databas is big, and if it's proven that having
reverse indexes has benefit, then we should create them.
In all cases, we should not close any door.
This could be
massively expensive and would cause cache over-turn and most searches. The
performance would degrade considerably and the whole search engine must be
altered to handle things like alias dereferencing differently which relies
heavily on reverse indices.
Certainly we can have improvements in the algorithm; it is by no means
perfect. However please realize that this search engine design and db
structure has suited our needs for over 8 years with minor tweaks here and
there. I'm fairly confident that the removal of reverse tables in indices
will be disastrous but in now way would I say we should not give it a try.
But please experiment in another branch. This is by far one of the most
critical aspects of the LDAP server.
Well, I didn't intend to change *anything* atm. It's far too premature,
and I just wanted to push those ideas here for future evaluations.
Those kind of experimentation deserver obviously to be done in a branch.
What I expect at the end of the day is that we find the best possible
algotithme, assuming that we don't have a fit-all solution. If only we
were able to tune the search algorithme depending on the memory sise,
the data size, and the requests we receive, then it would be great !
In any case, it's an interesting discussion!
--
Regards,
Cordialement,
Emmanuel Lécharny
www.nextury.com