Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

Shawn Heisey Thu, 02 Feb 2017 07:57:06 -0800

On 2/2/2017 8:15 AM, Maciej Ł. PCSS wrote:
> regardless of the value of such a use-case, there is another thing
> that stays unknown for me.
>
> Does SOLR support a simple and silly 'exact substring match'? I mean,
> is it possible to search for (actually filter by) a raw substring
> without tokenization and without any kind of processing/simplifying
> the searched information? By a 'raw substring' I mean a character
> string that, among others, can contain non-letters (colons, brackets,
> etc.) - basically everything the user is able to input via keyboard.
>
> Does this use case meet SOLR technical possibilities even if that
> means a big efficiency cost?


Because you want to do substring matches, things are somewhat more
complicated than if you wanted to do a full exact-string-only query.

First I'll tackle the full exact query idea, because the info is also
important for substrings:

If the class in the fieldType is "solr.StrField" then the input will be
indexed exactly as it is sent, all characters preserved, and all
characters needing to be in the query.

On the query side, you would need to escape any special characters in
the query string -- spaces, colons, and several other characters. 
Escaping is done with the backslash.  If you are manually constructing
URL parameters for an HTTP request, you would also need to be aware of
URL encoding.  Some Solr libraries (like SolrJ) are capable of handling
all the URL encoding for you.

Matching *substrings* with StrField would involve either a regular
expression query (with .* before and after) or a wildcard query, which
Erick described in his reply.

An alternate way to do substring matches is the NGram or EdgeNGram
filters, and not using wildcards or regex.  This method will increase
your index size, possibly by a large amount.  To use this method, you'd
need to switch back to solr.TextField, use the keyword tokenizer, and
then follow that with the appropriate NGram filter.  Depending on your
exact needs, you might only do the NGram filter on the index side, or
you might need it on both index and query analysis.  Escaping special
characters on the query side would still be required.

The full list of characters that require escaping is at the end of this
page:

http://lucene.apache.org/core/6_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html?is-external=true#Escaping_Special_Characters

Note that it shows && and || as special characters, even though these
are in fact two characters each.  Typically even a single instance of
these characters requires escaping.  Solr will also need spaces to be
escaped.

Thanks,
Shawn

Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

Reply via email to