On 5/13/10 8:08 AM, Alex Karasulu wrote:
Hi Emmanuel,

On Sun, May 9, 2010 at 6:42 PM, Emmanuel Lecharny<[email protected]>wrote:
...

As soon as you consider that a delete operation will need to update all the
index the deleted entry uses, you see that you can benefit from having such
reverse index : you don't have to grab the entry from the backend, as you
already have it's ID, which is all what you need. Thus, a delete operation
is just about doing :
- get the entry ID
- for each index, if we have an<Entry ID>  stored, then grab the associated
values, and for each value, delete them from the forward index.
- delete the entry from the MT


Yep this is one benefit of the reverse index.
At this point, I'm questionning the fact that it's an advantage. We can do almost the same thing using the forward index, assuming we know the attribute we are dealing with, and we also have the value in the filter, so we can check that the entry we are looking at is present in the index for this attribuet. Let me explain with some example :

filter : (cn=value), cn is indexed.
you are looking for entry E(x)
The reverse index will contain :
E(x) --> value
the forward index will contain :
vaue --> E(1), E(5), ... E(x)...E(N)

so you can still in a O(log(N)) find the entry in the forward index without fetching the entry first.

There are other uses for the
reverse index. To start let's just list the various index methods leveraging
the reverse index:


     boolean reverse( ID id ) throws Exception;


     boolean reverse( ID id, K attrVal ) throws Exception;


     boolean reverseGreaterOrEq( ID id ) throws Exception;


     boolean reverseGreaterOrEq( ID id, K attrVal ) throws Exception;


     boolean reverseLessOrEq( ID id ) throws Exception;


     boolean reverseLessOrEq( ID id, K attrVal ) throws Exception;
In fact, those methods are not existing anymore (don't ask me why, I just grepped all the code). I don't know where you get them from ...

The only place where the reverse index is used atm is when checking that an entry has a value.

These various reverse lookups are used all over the XDBM search code base
with various expression Evaluators and Cursors.
Well, not anymore :)
  They are used to evaluate
assertion expressions in filters during search. Without reverse indices
entries must be accessed and deserialized from the master table for every
potential candidate before full filter evaluation takes place.
Nope. You can still use the forward index. But even, it might be valuable to grab the entry from the disk and check directly against the filter in memory, instead of reading index. My perception is that we can save a lot of time assuming that the set of returned entries is low, and if you have more than a couple of filter expressions, as reading index might lead to disk access.

Now, it all depends on the memory size. If all the index and entries are in memory (optimal solution), then fetching an entry cost nothing, but we save all the index brwosing by checking the entry against the filter directly.

Of course, if we have disk access, then reading entries from the disk is costly, but so is the index browsing, as you will have to hit the disk too. Considering that you have N entry candidates, 1/3 of them being valid candidate (ie willbe returned to the client), then the other filter expressions will be used to eliminate those 2/3 of bad entries. If you have, say, M filter expressions, then you will potentially hit the disk while reading the (M-1) index, then when the entry fits, read the entry from the disk too. If the entry does not fit, you just save the MT access. So for valid entries you have an extra cost, as you read it from disk anyway, plus all the index potential disk acces. For invalid entries, you still have the index reads.

Of course, this is all but proven, I'm talking from a theorical POV. I'm just trying to rationalise the agorithm, in order to have more than one perspective about the search performance.

By no mean I think that the proposed solution is faster, I just think that it might be a good idea to test it.

Now, an even better option : all thise should be optionnal. If we can fit all the data in memory, then I don't think that we should use reverse indexes. If the databas is big, and if it's proven that having reverse indexes has benefit, then we should create them.

In all cases, we should not close any door.
This could be
massively expensive and would cause cache over-turn and most searches.  The
performance would degrade considerably and the whole search engine must be
altered to handle things like alias dereferencing differently which relies
heavily on reverse indices.

Certainly we can have improvements in the algorithm; it is by no means
perfect.  However please realize that this search engine design and db
structure has suited our needs for over 8 years with minor tweaks here and
there.  I'm fairly confident that the removal of reverse tables in indices
will be disastrous but in now way would I say we should not give it a try.
  But please experiment in another branch. This is by far one of the most
critical aspects of the LDAP server.
Well, I didn't intend to change *anything* atm. It's far too premature, and I just wanted to push those ideas here for future evaluations.

Those kind of experimentation deserver obviously to be done in a branch.

What I expect at the end of the day is that we find the best possible algotithme, assuming that we don't have a fit-all solution. If only we were able to tune the search algorithme depending on the memory sise, the data size, and the requests we receive, then it would be great !

In any case, it's an interesting discussion!


--
Regards,
Cordialement,
Emmanuel Lécharny
www.nextury.com


Reply via email to