Re: Search engine performances...

Emmanuel Lécharny Thu, 11 Aug 2016 11:03:06 -0700

Le 11/08/16 à 18:05, Alex Karasulu a écrit :
> Hi Em,
>
> The substring filter is a big PITA for sure. As you know the substring 
> filters with a fixed prefix like the one you show here with ‘abc’ prefix is 
> the best kind we can get for implementing a realistic scan count: much better 
> than a ‘*xyz’. Maybe we can use tricks on the index if it exists for these 
> classes of substring expressions. Like for example advancing to the first and 
> last ‘abc’ prefix occurrence to figure out an accurate scan count.
>
> An approach that could be taken with the class of substring expressions with 
> suffix terms (i.e. ‘*xyz’ ) is to use reverse string indices in addition to 
> the current forward string index. This comes with a cost though of building 
> and maintaining the index. However it would speed up most classes of 
> substring expressions.  
>
> The other substring filter expressions without prefixes or suffixes would 
> require an inhibitive full scan of the index: i.e. ‘*klm*’. Pointless to do 
> and would clear cache memory with the churn. So your 10% of total size thingy 
> if configurable by the administrator makes sense as a best guess before the 
> optimizer goes to work.


There are other options, like the ones implemented by OpenLDAP, OpenDJ,
OpenDS : build indexes based on part of the values. For instance, let
say we have entry 1 : 'cn=A value', we can imagine indexing every 3
consecutive letters for this value. The index will then contain things
like :

'A v' -> entry 1
' va' -> entry 1
'val' -> entry 1
'alu' -> entry 1
'lue' -> entry 1

Now, searching for '*lue' will brings entry 1 immediately, as searching
for 'A v*', as searching for '*val*'

That comes with a cost : the index will be huge. Openldap and other
allows you to tune this index in many ways. Typically, Openldap has the
following configuration parameters to tune the index :
*index_substr_if_minlen, **index_substr_if_maxlen,
**index_substr_any_len, **index_substr_any_step

The XXX_step parameters is by default set to 2, which allows to split
the index by a factor 2, but will require an average of 1.5 lookups if
the entry is present, or 2 lookups if it's not present. With a bigger
step the index will be even smaller (so there is a gain in the index
size) but will require more lookups.

This also solves the *xyz problem : you don't need an extra revert index.

Note that it's not a perfect solution either : if the substring in the
filter is bigger than the size of the indexed splitted value, you are
more likely to get duplicates (and wrong ones). OTOH, it gives an
accurate number of candidate, compared to teh holistic approach we use...

The best solution does not exist. The only way to get an accurate count
would be to let the user to fine tune the index (or to have a smart
indexer, that evolves through the analysis of the search request being
done... Not likely to be implemented soon ;-)
*

Re: Search engine performances...

Reply via email to