About reverse index ...

Emmanuel Lecharny Wed, 12 May 2010 08:42:36 -0700

Hi guys,

in an attempt to jump into the backend database, and due to the problemwe have on the add operation, I looked at the current implementation(based on JDBM) and the way we use it.

Right now, we have one master table containing all the entries, plusmany indexes, some of them being system indexes (CSN, UUID, RDN, etc)and other being user defined.


Each of these index are a composition of two tables :

- a forward index : from a key, you get a link to entries in the MasterTable (MT)- a reverse index : from an entry ID, you get a link to all thecontained values

As soon as you consider that a delete operation will need to update allthe index the deleted entry uses, you see that you can benefit fromhaving such reverse index : you don't have to grab the entry from thebackend, as you already have it's ID, which is all what you need. Thus,a delete operation is just about doing :

- get the entry ID

- for each index, if we have an <Entry ID> stored, then grab theassociated values, and for each value, delete them from the forward index.

- delete the entry from the MT

Sounds like you have avoid a fetch from the MT, but you have to pay anheavy penalty for that :- first, has you have no idea about how many index are used for thisentry (suppose that the entry does not contain the optional indexed 'cn'attribute), you still have to check in the index.

- second, you have to maintain 2 tables for each index

IMO, the cost is way to expensive compared to the basic approach : grabthe entry.

Now, it's not enough to say 'kill the reverse index !'. There is onemore reason why we want to have this reverse index, the question is :does it brings a lot of benefit ?

So why do we need this reverse index ? We discussed about it withStefan, and here is what it is used for : Suppose you have a searchrequest with a filter like (&(ObjectClass=XXX)(cn=YYY)).

The search engine will evaluate the number of entries each of thosefilters node will get back. Suppose it's N for the OC filter, and M forthe cn filter. Let's say that N > M.

Now, we will loop on the M entries to check if they fit the OC filter.

How do we do that ? The first approach would be to grab the entry, andcheck in memory of the filter (ObjectClass=XXX) match the entry. If not,we ditch the entry, otherwise, itas a valid candidate.

The second option is to use the reverse index : we have the entry ID (wehavdn't grabbed the entry from disk yet), and we can see if the OC tablecontains a reference for this entry ID. If not, we can move to the nextentry ID. Otherwise, we can grab the entry and return it.

Obviously there is some potential for a speedup. Now, let's consider thecases where we benefit from not grabbing the entry.


1) We grab all the entry using the smallest set selected with the filters
pros :

- Fast entry filtering, it's just a question on applying the filters onthe entry- It can be used very late, as we have already grabbed the entry. Theentry selection can not only be based on the filter, but can also becombined with virtual attributes which has been added on the grabbed entry- That means we can move the search engine out of the backend, and haveit working on the top of the Interceptor chain


cons :
- We may read way more entries than necessary

2) we just get the Entry ID and use the reverse index to select the entries
pros :
- we don't grab the entries unless absolutely necessary

cons :

- we may have to check in many indexes (as many as we have exprNodes inthe filter expression, assuming that each of them are indexed), whichmeans lot of O(log(N)) operations on these index (assuming that allthose index are in memory, otherwise, if we hit the disk, the benefit isnull)- as soon as we have one single non indexed exprNode in the filter wehave to check, then we will have to grab the entry anyway. The benefitis that we may save some disk access.

- of course, we have to maintain all the reverse indexes.

- if the exprNode is associated with an attribute not present in teentry, we do a lookup in the index for nothing- if the intersection between the two exprNode is big (ieNb(ObjectClass=XXX) inter Nb(cn=YYY)), then the gin will be low, and ifthis itersection is small, then it's likely that the smallest set hasbeen already selected as the main index to use to grab entries, thusleading to a small number of entries to grab.



So here is what I suggest :
- get rid of those reverse index
- or, at least, make it optionnal

thoughts ?

--
Regards,
Cordialement,
Emmanuel Lécharny
www.nextury.com

About reverse index ...

Reply via email to