Re: Search engine performances...

Alex Karasulu Thu, 11 Aug 2016 12:11:29 -0700

> On Aug 11, 2016, at 2:02 PM, Emmanuel Lécharny <[email protected]> wrote:
> 
> Le 11/08/16 à 18:05, Alex Karasulu a écrit :
>> Hi Em,
>> 
>> The substring filter is a big PITA for sure. As you know the substring 
>> filters with a fixed prefix like the one you show here with ‘abc’ prefix is 
>> the best kind we can get for implementing a realistic scan count: much 
>> better than a ‘*xyz’. Maybe we can use tricks on the index if it exists for 
>> these classes of substring expressions. Like for example advancing to the 
>> first and last ‘abc’ prefix occurrence to figure out an accurate scan count.
>> 
>> An approach that could be taken with the class of substring expressions with 
>> suffix terms (i.e. ‘*xyz’ ) is to use reverse string indices in addition to 
>> the current forward string index. This comes with a cost though of building 
>> and maintaining the index. However it would speed up most classes of 
>> substring expressions.  
>> 
>> The other substring filter expressions without prefixes or suffixes would 
>> require an inhibitive full scan of the index: i.e. ‘*klm*’. Pointless to do 
>> and would clear cache memory with the churn. So your 10% of total size 
>> thingy if configurable by the administrator makes sense as a best guess 
>> before the optimizer goes to work.
> 
> There are other options, like the ones implemented by OpenLDAP, OpenDJ,
> OpenDS : build indexes based on part of the values. For instance, let
> say we have entry 1 : 'cn=A value', we can imagine indexing every 3
> consecutive letters for this value. The index will then contain things
> like :
> 
> 'A v' -> entry 1
> ' va' -> entry 1
> 'val' -> entry 1
> 'alu' -> entry 1
> 'lue' -> entry 1
> 
> Now, searching for '*lue' will brings entry 1 immediately, as searching
> for 'A v*', as searching for '*val*'
> 
> That comes with a cost : the index will be huge.


Yeah I had seen these “create every substring permutation indices” in other 
LDAP servers. I just turned around, said "Oye vey," and walked the other away. 

> Openldap and other
> allows you to tune this index in many ways. Typically, Openldap has the
> following configuration parameters to tune the index :
> *index_substr_if_minlen, **index_substr_if_maxlen,
> **index_substr_any_len, **index_substr_any_step
> 

Yup lots of management overhead, too many one offs, and lots of index bloat 
potential IMHO. Plus this can be a long running task drawing out the life of a 
transactional update if we have to generate each permutation on substring index 
updates.

> The XXX_step parameters is by default set to 2, which allows to split
> the index by a factor 2, but will require an average of 1.5 lookups if
> the entry is present, or 2 lookups if it's not present. With a bigger
> step the index will be even smaller (so there is a gain in the index
> size) but will require more lookups.
> 
> This also solves the *xyz problem : you don't need an extra revert index.
> 

Yeah it does but it also increases complexity. However if folks want to deal 
with that my hats off to them. 

The reverse index only seems simple and easy. It deals with another major class 
of substring expressions having suffixes. I have a feeling most substring 
expressions are going to have prefixes and/or suffixes. Even if middle terms 
are present scan counts on suffix or prefix terms are enough to get a correct 
scan count. If both are present the prefix and suffix are like AND'd 
expressions where each term shrinks the search space further.

We can throw our hands up in the air when users decide to not include a prefix 
or suffix: i.e. STAR MIDDLE_TERMS STAR. The server can just take an educated 
guess. This seems like the most sane tradeoff.

> Note that it's not a perfect solution either : if the substring in the
> filter is bigger than the size of the indexed splitted value, you are
> more likely to get duplicates (and wrong ones). OTOH, it gives an
> accurate number of candidate, compared to teh holistic approach we use...
> 
> The best solution does not exist. The only way to get an accurate count
> would be to let the user to fine tune the index (or to have a smart
> indexer, that evolves through the analysis of the search request being
> done... Not likely to be implemented soon ;-)
> *

Right.

Best,
Alex

Re: Search engine performances...

Reply via email to