Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

Erick Erickson Thu, 02 Feb 2017 07:29:59 -0800

Well, the *whatever* syntax will work.
(that's asterisk-whatever-asterisk if funky bolding happens). You'd
use it on a "string" field (unanalyzed, case sensitive) or perhaps on
some field with KeywordTokenizerFactory possibly followed by
LowercaseFilterFactory if you wanted case-insensitive matches. I think
you have to enable leading wildcards as well.


There's some trickiness getting all this past the query _parser_
though at query time and URL-encoding the odd characters may be
required. There's  been some recent work done to get spaces through
the query arsing step, but in any case you can escape the spaces with
a backslash.

So yes, it's technically possible. Query times will be poor with lots
of data though. Whether "good enough" or not is application
dependent....

Best,
Erick

On Thu, Feb 2, 2017 at 7:15 AM, Maciej Ł. PCSS <labed...@man.poznan.pl> wrote:
> Hi Erick, All,
>
> regardless of the value of such a use-case, there is another thing that
> stays unknown for me.
>
> Does SOLR support a simple and silly 'exact substring match'? I mean, is it
> possible to search for (actually filter by) a raw substring without
> tokenization and without any kind of processing/simplifying the searched
> information? By a 'raw substring' I mean a character string that, among
> others, can contain non-letters (colons, brackets, etc.) - basically
> everything the user is able to input via keyboard.
>
> Does this use case meet SOLR technical possibilities even if that means a
> big efficiency cost?
>
> Regards
> Maciej
>
>
> W dniu 30.01.2017 o 17:12, Erick Erickson pisze:
>>
>> Well, the usual Solr solution to leading and trailing wildcards is to
>> ngram the field. You can get the entire field (incuding spaces) to be
>> analyzed as a whole by using KeywordTokenizer. Sometimes you wind up
>> using a copyField to support this and search against one or the other
>> if necessary.
>>
>> You can do this with KeywordTokenizer and '*a bcd ef*", but that'll be
>> slow for the exact same reason the SQL query is slow: It has to
>> examine every value in every document to find terms that match then
>> search on those.
>>
>> There's some index size cost here so you'll have to test.
>>
>> Really go back to your use-case to see if this is _really_ necessary
>> though. Often people think it is because that's the only way they've
>> been able to search at all in SQL and it can turn out that there are
>> other ways to solve it. IOW, this could be an XY problem.
>>
>> Best,
>> Erick
>>
>> On Mon, Jan 30, 2017 at 1:52 AM, Maciej Ł. PCSS <labed...@man.poznan.pl>
>> wrote:
>>>
>>> Hi All,
>>>
>>> What solution have you applied in your implementations?
>>>
>>> Regards
>>> Maciej
>>>
>>>
>>> W dniu 24.01.2017 o 14:10, Maciej Ł. PCSS pisze:
>>>>
>>>> Dear SOLR users,
>>>>
>>>> please point me to the right solution of my problem. I'm using SOLR to
>>>> implement a Google-like search in my application and this scenario is
>>>> working fine.
>>>>
>>>> However, in specific use-cases I need to filter documents that include a
>>>> specific substring in a given field. It's about an SQL-like query
>>>> similar to
>>>> this:
>>>>
>>>> SELECT *  FROM table WHERE someField = '%c def g%'
>>>>
>>>> I expect to match documents having someField ='abc def ghi'. That means
>>>> I
>>>> expect match parts of words.
>>>>
>>>> As I understand SOLR, as a reversed-index, does work with tokens rather
>>>> that character strings and thereby looks for whole words (not
>>>> substrings).
>>>>
>>>> Is there any solution for such an issue?
>>>>
>>>> Regards
>>>> Maciej Łabędzki
>>>
>>>
>

Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

Reply via email to