Emmanuel Lécharny wrote:
Le 11/08/16 à 18:05, Alex Karasulu a écrit :
The other substring filter expressions without prefixes or suffixes would 
require an inhibitive full scan of the index: i.e. ‘*klm*’. Pointless to do and 
would clear cache memory with the churn. So your 10% of total size thingy if 
configurable by the administrator makes sense as a best guess before the 
optimizer goes to work.

There are other options, like the ones implemented by OpenLDAP, OpenDJ,
OpenDS : build indexes based on part of the values. For instance, let
say we have entry 1 : 'cn=A value', we can imagine indexing every 3
consecutive letters for this value. The index will then contain things
like :

'A v' -> entry 1
' va' -> entry 1
'val' -> entry 1
'alu' -> entry 1
'lue' -> entry 1

Now, searching for '*lue' will brings entry 1 immediately, as searching
for 'A v*', as searching for '*val*'

That comes with a cost : the index will be huge. Openldap and other
allows you to tune this index in many ways. Typically, Openldap has the
following configuration parameters to tune the index :
*index_substr_if_minlen, **index_substr_if_maxlen,
**index_substr_any_len, **index_substr_any_step

The XXX_step parameters is by default set to 2, which allows to split
the index by a factor 2, but will require an average of 1.5 lookups if
the entry is present, or 2 lookups if it's not present. With a bigger
step the index will be even smaller (so there is a gain in the index
size) but will require more lookups.

This also solves the *xyz problem : you don't need an extra revert index.

Note that it's not a perfect solution either : if the substring in the
filter is bigger than the size of the indexed splitted value, you are
more likely to get duplicates (and wrong ones). OTOH, it gives an
accurate number of candidate, compared to teh holistic approach we use...

The best solution does not exist. The only way to get an accurate count
would be to let the user to fine tune the index (or to have a smart
indexer, that evolves through the analysis of the search request being
done... Not likely to be implemented soon ;-)
*

Substring indexing is always a challenge, but I think trigram-based as we've implemented it is as near to optimal as exists. It has an additional weakness in that you can't use the index for shorter terms. E.g. we can't find "ab*" or "a*" with a 3-character substring index.

The other benefit from this approach is the space tradeoff, we know our index is a constant factor larger than the original string. With string-reverse indexing, your index size is n^2/2...

--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

Reply via email to