Alex Karasulu wrote:

On Aug 11, 2016, at 4:20 PM, Howard Chu <[email protected]
<mailto:[email protected]>> wrote:

Emmanuel Lécharny wrote:
Le 11/08/16 à 18:05, Alex Karasulu a écrit :
The other substring filter expressions without prefixes or suffixes would
require an inhibitive full scan of the index: i.e. ‘*klm*’. Pointless to
do and would clear cache memory with the churn. So your 10% of total size
thingy if configurable by the administrator makes sense as a best guess
before the optimizer goes to work.

There are other options, like the ones implemented by OpenLDAP, OpenDJ,
OpenDS : build indexes based on part of the values. For instance, let
say we have entry 1 : 'cn=A value', we can imagine indexing every 3
consecutive letters for this value. The index will then contain things
like :

'A v' -> entry 1
' va' -> entry 1
'val' -> entry 1
'alu' -> entry 1
'lue' -> entry 1

Now, searching for '*lue' will brings entry 1 immediately, as searching
for 'A v*', as searching for '*val*'

That comes with a cost : the index will be huge. Openldap and other
allows you to tune this index in many ways. Typically, Openldap has the
following configuration parameters to tune the index :
*index_substr_if_minlen, **index_substr_if_maxlen,
**index_substr_any_len, **index_substr_any_step

The XXX_step parameters is by default set to 2, which allows to split
the index by a factor 2, but will require an average of 1.5 lookups if
the entry is present, or 2 lookups if it's not present. With a bigger
step the index will be even smaller (so there is a gain in the index
size) but will require more lookups.

This also solves the *xyz problem : you don't need an extra revert index.

Note that it's not a perfect solution either : if the substring in the
filter is bigger than the size of the indexed splitted value, you are
more likely to get duplicates (and wrong ones). OTOH, it gives an
accurate number of candidate, compared to teh holistic approach we use...

The best solution does not exist. The only way to get an accurate count
would be to let the user to fine tune the index (or to have a smart
indexer, that evolves through the analysis of the search request being
done... Not likely to be implemented soon ;-)
*

Substring indexing is always a challenge, but I think trigram-based as we've
implemented it is as near to optimal as exists. It has an additional
weakness in that you can't use the index for shorter terms. E.g. we can't
find "ab*" or "a*" with a 3-character substring index.

The other benefit from this approach is the space tradeoff, we know our
index is a constant factor larger than the original string. With
string-reverse indexing, your index size is n^2/2…

Hi Howard !!! Hope all is well with you.

Heya Alex, good to see ya.

You got me curious. What’s the index space consumption order with the trigram
based approach?

It's basically N/index_substr_any_step * index_substr_any_len. In OpenLDAP the actual index is a hash, so it's a constant, and not dependent on the configured index_substr_any_len.

--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

Reply via email to