Le 11/08/16 à 18:05, Alex Karasulu a écrit : > Hi Em, > > The substring filter is a big PITA for sure. As you know the substring > filters with a fixed prefix like the one you show here with ‘abc’ prefix is > the best kind we can get for implementing a realistic scan count: much better > than a ‘*xyz’. Maybe we can use tricks on the index if it exists for these > classes of substring expressions. Like for example advancing to the first and > last ‘abc’ prefix occurrence to figure out an accurate scan count. > > An approach that could be taken with the class of substring expressions with > suffix terms (i.e. ‘*xyz’ ) is to use reverse string indices in addition to > the current forward string index. This comes with a cost though of building > and maintaining the index. However it would speed up most classes of > substring expressions. > > The other substring filter expressions without prefixes or suffixes would > require an inhibitive full scan of the index: i.e. ‘*klm*’. Pointless to do > and would clear cache memory with the churn. So your 10% of total size thingy > if configurable by the administrator makes sense as a best guess before the > optimizer goes to work.
There are other options, like the ones implemented by OpenLDAP, OpenDJ, OpenDS : build indexes based on part of the values. For instance, let say we have entry 1 : 'cn=A value', we can imagine indexing every 3 consecutive letters for this value. The index will then contain things like : 'A v' -> entry 1 ' va' -> entry 1 'val' -> entry 1 'alu' -> entry 1 'lue' -> entry 1 Now, searching for '*lue' will brings entry 1 immediately, as searching for 'A v*', as searching for '*val*' That comes with a cost : the index will be huge. Openldap and other allows you to tune this index in many ways. Typically, Openldap has the following configuration parameters to tune the index : *index_substr_if_minlen, **index_substr_if_maxlen, **index_substr_any_len, **index_substr_any_step The XXX_step parameters is by default set to 2, which allows to split the index by a factor 2, but will require an average of 1.5 lookups if the entry is present, or 2 lookups if it's not present. With a bigger step the index will be even smaller (so there is a gain in the index size) but will require more lookups. This also solves the *xyz problem : you don't need an extra revert index. Note that it's not a perfect solution either : if the substring in the filter is bigger than the size of the indexed splitted value, you are more likely to get duplicates (and wrong ones). OTOH, it gives an accurate number of candidate, compared to teh holistic approach we use... The best solution does not exist. The only way to get an accurate count would be to let the user to fine tune the index (or to have a smart indexer, that evolves through the analysis of the search request being done... Not likely to be implemented soon ;-) *
