Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

2017-05-31 Thread Erick Erickson
Your searches against the ascii_ignorecase_string field will suffer
performance wise, SQL-like %whatever% queries have to essentially do a
table scan and assemble (conceptually) a huge OR clause consisting of
all the terms (in this case strings) that match.

Shawn's comment on using NGrams is the way this is usually done.

Best,
Erick

On Wed, May 31, 2017 at 12:28 AM, Maciej Ł. PCSS  wrote:
> Shawn, thank you for your response.
>
> Finally, my search is based on two kinds of fields (strings and text, both
> ignoring case and special characters) that potentially can contain any
> language but mainly Polish or English. This is because the two main
> requirements were:
> 1) Google-like search for quick lookups,
> 2) Precise multi-criteria search.
>
> For the first option we use the "ascii_ignorecase_text" field type (below).
> For second case we applied the "ascii_ignorecase_string". It is very often
> that the customer knows only part of an identifier / sample name / user's
> surname / address, and still the customer wants to search by that partial
> information. The application is about exploring a scientific database of
> biological samples, each of them having lots of attributes.
>
> Considering the above I'm fine with the following type definitions:
>
>positionIncrementGap="100">
> 
>   
>   
>   
> 
>   
>positionIncrementGap="100">
> 
>   
>   
>   
> 
>   
>
> Thank you for your help!
>
> Regards
> Maciej Łabędzki
>
>
> W dniu 02.02.2017 o 16:55, Shawn Heisey pisze:
>>
>> On 2/2/2017 8:15 AM, Maciej Ł. PCSS wrote:
>>>
>>> regardless of the value of such a use-case, there is another thing
>>> that stays unknown for me.
>>>
>>> Does SOLR support a simple and silly 'exact substring match'? I mean,
>>> is it possible to search for (actually filter by) a raw substring
>>> without tokenization and without any kind of processing/simplifying
>>> the searched information? By a 'raw substring' I mean a character
>>> string that, among others, can contain non-letters (colons, brackets,
>>> etc.) - basically everything the user is able to input via keyboard.
>>>
>>> Does this use case meet SOLR technical possibilities even if that
>>> means a big efficiency cost?
>>
>> Because you want to do substring matches, things are somewhat more
>> complicated than if you wanted to do a full exact-string-only query.
>>
>> First I'll tackle the full exact query idea, because the info is also
>> important for substrings:
>>
>> If the class in the fieldType is "solr.StrField" then the input will be
>> indexed exactly as it is sent, all characters preserved, and all
>> characters needing to be in the query.
>>
>> On the query side, you would need to escape any special characters in
>> the query string -- spaces, colons, and several other characters.
>> Escaping is done with the backslash.  If you are manually constructing
>> URL parameters for an HTTP request, you would also need to be aware of
>> URL encoding.  Some Solr libraries (like SolrJ) are capable of handling
>> all the URL encoding for you.
>>
>> Matching *substrings* with StrField would involve either a regular
>> expression query (with .* before and after) or a wildcard query, which
>> Erick described in his reply.
>>
>> An alternate way to do substring matches is the NGram or EdgeNGram
>> filters, and not using wildcards or regex.  This method will increase
>> your index size, possibly by a large amount.  To use this method, you'd
>> need to switch back to solr.TextField, use the keyword tokenizer, and
>> then follow that with the appropriate NGram filter.  Depending on your
>> exact needs, you might only do the NGram filter on the index side, or
>> you might need it on both index and query analysis.  Escaping special
>> characters on the query side would still be required.
>>
>> The full list of characters that require escaping is at the end of this
>> page:
>>
>>
>> http://lucene.apache.org/core/6_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html?is-external=true#Escaping_Special_Characters
>>
>> Note that it shows && and || as special characters, even though these
>> are in fact two characters each.  Typically even a single instance of
>> these characters requires escaping.  Solr will also need spaces to be
>> escaped.
>>
>> Thanks,
>> Shawn
>
>


Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

2017-05-31 Thread Maciej Ł. PCSS

Shawn, thank you for your response.

Finally, my search is based on two kinds of fields (strings and text, 
both ignoring case and special characters) that potentially can contain 
any language but mainly Polish or English. This is because the two main 
requirements were:

1) Google-like search for quick lookups,
2) Precise multi-criteria search.

For the first option we use the "ascii_ignorecase_text" field type 
(below). For second case we applied the "ascii_ignorecase_string". It is 
very often that the customer knows only part of an identifier / sample 
name / user's surname / address, and still the customer wants to search 
by that partial information. The application is about exploring a 
scientific database of biological samples, each of them having lots of 
attributes.


Considering the above I'm fine with the following type definitions:

  positionIncrementGap="100">


  
  
  

  
  positionIncrementGap="100">


  
  
  

  

Thank you for your help!

Regards
Maciej Łabędzki


W dniu 02.02.2017 o 16:55, Shawn Heisey pisze:

On 2/2/2017 8:15 AM, Maciej Ł. PCSS wrote:

regardless of the value of such a use-case, there is another thing
that stays unknown for me.

Does SOLR support a simple and silly 'exact substring match'? I mean,
is it possible to search for (actually filter by) a raw substring
without tokenization and without any kind of processing/simplifying
the searched information? By a 'raw substring' I mean a character
string that, among others, can contain non-letters (colons, brackets,
etc.) - basically everything the user is able to input via keyboard.

Does this use case meet SOLR technical possibilities even if that
means a big efficiency cost?

Because you want to do substring matches, things are somewhat more
complicated than if you wanted to do a full exact-string-only query.

First I'll tackle the full exact query idea, because the info is also
important for substrings:

If the class in the fieldType is "solr.StrField" then the input will be
indexed exactly as it is sent, all characters preserved, and all
characters needing to be in the query.

On the query side, you would need to escape any special characters in
the query string -- spaces, colons, and several other characters.
Escaping is done with the backslash.  If you are manually constructing
URL parameters for an HTTP request, you would also need to be aware of
URL encoding.  Some Solr libraries (like SolrJ) are capable of handling
all the URL encoding for you.

Matching *substrings* with StrField would involve either a regular
expression query (with .* before and after) or a wildcard query, which
Erick described in his reply.

An alternate way to do substring matches is the NGram or EdgeNGram
filters, and not using wildcards or regex.  This method will increase
your index size, possibly by a large amount.  To use this method, you'd
need to switch back to solr.TextField, use the keyword tokenizer, and
then follow that with the appropriate NGram filter.  Depending on your
exact needs, you might only do the NGram filter on the index side, or
you might need it on both index and query analysis.  Escaping special
characters on the query side would still be required.

The full list of characters that require escaping is at the end of this
page:

http://lucene.apache.org/core/6_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html?is-external=true#Escaping_Special_Characters

Note that it shows && and || as special characters, even though these
are in fact two characters each.  Typically even a single instance of
these characters requires escaping.  Solr will also need spaces to be
escaped.

Thanks,
Shawn




Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

2017-02-02 Thread Mikhail Khludnev
Have anybody tried to tweak AnalysingSuggester with ngram token filter to
expand such infix queries?

On Thu, Feb 2, 2017 at 6:55 PM, Shawn Heisey  wrote:

> On 2/2/2017 8:15 AM, Maciej Ł. PCSS wrote:
> > regardless of the value of such a use-case, there is another thing
> > that stays unknown for me.
> >
> > Does SOLR support a simple and silly 'exact substring match'? I mean,
> > is it possible to search for (actually filter by) a raw substring
> > without tokenization and without any kind of processing/simplifying
> > the searched information? By a 'raw substring' I mean a character
> > string that, among others, can contain non-letters (colons, brackets,
> > etc.) - basically everything the user is able to input via keyboard.
> >
> > Does this use case meet SOLR technical possibilities even if that
> > means a big efficiency cost?
>
> Because you want to do substring matches, things are somewhat more
> complicated than if you wanted to do a full exact-string-only query.
>
> First I'll tackle the full exact query idea, because the info is also
> important for substrings:
>
> If the class in the fieldType is "solr.StrField" then the input will be
> indexed exactly as it is sent, all characters preserved, and all
> characters needing to be in the query.
>
> On the query side, you would need to escape any special characters in
> the query string -- spaces, colons, and several other characters.
> Escaping is done with the backslash.  If you are manually constructing
> URL parameters for an HTTP request, you would also need to be aware of
> URL encoding.  Some Solr libraries (like SolrJ) are capable of handling
> all the URL encoding for you.
>
> Matching *substrings* with StrField would involve either a regular
> expression query (with .* before and after) or a wildcard query, which
> Erick described in his reply.
>
> An alternate way to do substring matches is the NGram or EdgeNGram
> filters, and not using wildcards or regex.  This method will increase
> your index size, possibly by a large amount.  To use this method, you'd
> need to switch back to solr.TextField, use the keyword tokenizer, and
> then follow that with the appropriate NGram filter.  Depending on your
> exact needs, you might only do the NGram filter on the index side, or
> you might need it on both index and query analysis.  Escaping special
> characters on the query side would still be required.
>
> The full list of characters that require escaping is at the end of this
> page:
>
> http://lucene.apache.org/core/6_4_0/queryparser/org/apache/
> lucene/queryparser/classic/package-summary.html?is-external=true#Escaping_
> Special_Characters
>
> Note that it shows && and || as special characters, even though these
> are in fact two characters each.  Typically even a single instance of
> these characters requires escaping.  Solr will also need spaces to be
> escaped.
>
> Thanks,
> Shawn
>
>


-- 
Sincerely yours
Mikhail Khludnev


Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

2017-02-02 Thread Shawn Heisey
On 2/2/2017 8:15 AM, Maciej Ł. PCSS wrote:
> regardless of the value of such a use-case, there is another thing
> that stays unknown for me.
>
> Does SOLR support a simple and silly 'exact substring match'? I mean,
> is it possible to search for (actually filter by) a raw substring
> without tokenization and without any kind of processing/simplifying
> the searched information? By a 'raw substring' I mean a character
> string that, among others, can contain non-letters (colons, brackets,
> etc.) - basically everything the user is able to input via keyboard.
>
> Does this use case meet SOLR technical possibilities even if that
> means a big efficiency cost? 

Because you want to do substring matches, things are somewhat more
complicated than if you wanted to do a full exact-string-only query.

First I'll tackle the full exact query idea, because the info is also
important for substrings:

If the class in the fieldType is "solr.StrField" then the input will be
indexed exactly as it is sent, all characters preserved, and all
characters needing to be in the query.

On the query side, you would need to escape any special characters in
the query string -- spaces, colons, and several other characters. 
Escaping is done with the backslash.  If you are manually constructing
URL parameters for an HTTP request, you would also need to be aware of
URL encoding.  Some Solr libraries (like SolrJ) are capable of handling
all the URL encoding for you.

Matching *substrings* with StrField would involve either a regular
expression query (with .* before and after) or a wildcard query, which
Erick described in his reply.

An alternate way to do substring matches is the NGram or EdgeNGram
filters, and not using wildcards or regex.  This method will increase
your index size, possibly by a large amount.  To use this method, you'd
need to switch back to solr.TextField, use the keyword tokenizer, and
then follow that with the appropriate NGram filter.  Depending on your
exact needs, you might only do the NGram filter on the index side, or
you might need it on both index and query analysis.  Escaping special
characters on the query side would still be required.

The full list of characters that require escaping is at the end of this
page:

http://lucene.apache.org/core/6_4_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html?is-external=true#Escaping_Special_Characters

Note that it shows && and || as special characters, even though these
are in fact two characters each.  Typically even a single instance of
these characters requires escaping.  Solr will also need spaces to be
escaped.

Thanks,
Shawn



Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

2017-02-02 Thread Erick Erickson
Well, the *whatever* syntax will work.
(that's asterisk-whatever-asterisk if funky bolding happens). You'd
use it on a "string" field (unanalyzed, case sensitive) or perhaps on
some field with KeywordTokenizerFactory possibly followed by
LowercaseFilterFactory if you wanted case-insensitive matches. I think
you have to enable leading wildcards as well.

There's some trickiness getting all this past the query _parser_
though at query time and URL-encoding the odd characters may be
required. There's  been some recent work done to get spaces through
the query arsing step, but in any case you can escape the spaces with
a backslash.

So yes, it's technically possible. Query times will be poor with lots
of data though. Whether "good enough" or not is application
dependent

Best,
Erick

On Thu, Feb 2, 2017 at 7:15 AM, Maciej Ł. PCSS  wrote:
> Hi Erick, All,
>
> regardless of the value of such a use-case, there is another thing that
> stays unknown for me.
>
> Does SOLR support a simple and silly 'exact substring match'? I mean, is it
> possible to search for (actually filter by) a raw substring without
> tokenization and without any kind of processing/simplifying the searched
> information? By a 'raw substring' I mean a character string that, among
> others, can contain non-letters (colons, brackets, etc.) - basically
> everything the user is able to input via keyboard.
>
> Does this use case meet SOLR technical possibilities even if that means a
> big efficiency cost?
>
> Regards
> Maciej
>
>
> W dniu 30.01.2017 o 17:12, Erick Erickson pisze:
>>
>> Well, the usual Solr solution to leading and trailing wildcards is to
>> ngram the field. You can get the entire field (incuding spaces) to be
>> analyzed as a whole by using KeywordTokenizer. Sometimes you wind up
>> using a copyField to support this and search against one or the other
>> if necessary.
>>
>> You can do this with KeywordTokenizer and '*a bcd ef*", but that'll be
>> slow for the exact same reason the SQL query is slow: It has to
>> examine every value in every document to find terms that match then
>> search on those.
>>
>> There's some index size cost here so you'll have to test.
>>
>> Really go back to your use-case to see if this is _really_ necessary
>> though. Often people think it is because that's the only way they've
>> been able to search at all in SQL and it can turn out that there are
>> other ways to solve it. IOW, this could be an XY problem.
>>
>> Best,
>> Erick
>>
>> On Mon, Jan 30, 2017 at 1:52 AM, Maciej Ł. PCSS 
>> wrote:
>>>
>>> Hi All,
>>>
>>> What solution have you applied in your implementations?
>>>
>>> Regards
>>> Maciej
>>>
>>>
>>> W dniu 24.01.2017 o 14:10, Maciej Ł. PCSS pisze:

 Dear SOLR users,

 please point me to the right solution of my problem. I'm using SOLR to
 implement a Google-like search in my application and this scenario is
 working fine.

 However, in specific use-cases I need to filter documents that include a
 specific substring in a given field. It's about an SQL-like query
 similar to
 this:

 SELECT *  FROM table WHERE someField = '%c def g%'

 I expect to match documents having someField ='abc def ghi'. That means
 I
 expect match parts of words.

 As I understand SOLR, as a reversed-index, does work with tokens rather
 that character strings and thereby looks for whole words (not
 substrings).

 Is there any solution for such an issue?

 Regards
 Maciej Łabędzki
>>>
>>>
>


Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

2017-02-02 Thread Maciej Ł. PCSS

Hi Erick, All,

regardless of the value of such a use-case, there is another thing that 
stays unknown for me.


Does SOLR support a simple and silly 'exact substring match'? I mean, is 
it possible to search for (actually filter by) a raw substring without 
tokenization and without any kind of processing/simplifying the searched 
information? By a 'raw substring' I mean a character string that, among 
others, can contain non-letters (colons, brackets, etc.) - basically 
everything the user is able to input via keyboard.


Does this use case meet SOLR technical possibilities even if that means 
a big efficiency cost?


Regards
Maciej


W dniu 30.01.2017 o 17:12, Erick Erickson pisze:

Well, the usual Solr solution to leading and trailing wildcards is to
ngram the field. You can get the entire field (incuding spaces) to be
analyzed as a whole by using KeywordTokenizer. Sometimes you wind up
using a copyField to support this and search against one or the other
if necessary.

You can do this with KeywordTokenizer and '*a bcd ef*", but that'll be
slow for the exact same reason the SQL query is slow: It has to
examine every value in every document to find terms that match then
search on those.

There's some index size cost here so you'll have to test.

Really go back to your use-case to see if this is _really_ necessary
though. Often people think it is because that's the only way they've
been able to search at all in SQL and it can turn out that there are
other ways to solve it. IOW, this could be an XY problem.

Best,
Erick

On Mon, Jan 30, 2017 at 1:52 AM, Maciej Ł. PCSS  wrote:

Hi All,

What solution have you applied in your implementations?

Regards
Maciej


W dniu 24.01.2017 o 14:10, Maciej Ł. PCSS pisze:

Dear SOLR users,

please point me to the right solution of my problem. I'm using SOLR to
implement a Google-like search in my application and this scenario is
working fine.

However, in specific use-cases I need to filter documents that include a
specific substring in a given field. It's about an SQL-like query similar to
this:

SELECT *  FROM table WHERE someField = '%c def g%'

I expect to match documents having someField ='abc def ghi'. That means I
expect match parts of words.

As I understand SOLR, as a reversed-index, does work with tokens rather
that character strings and thereby looks for whole words (not substrings).

Is there any solution for such an issue?

Regards
Maciej Łabędzki






Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

2017-01-31 Thread Maciej Ł. PCSS

Thank you Erick. Yes, I'm still thinking about that use case.

Regards
Maciej


W dniu 30.01.2017 o 17:12, Erick Erickson pisze:

Well, the usual Solr solution to leading and trailing wildcards is to
ngram the field. You can get the entire field (incuding spaces) to be
analyzed as a whole by using KeywordTokenizer. Sometimes you wind up
using a copyField to support this and search against one or the other
if necessary.

You can do this with KeywordTokenizer and '*a bcd ef*", but that'll be
slow for the exact same reason the SQL query is slow: It has to
examine every value in every document to find terms that match then
search on those.

There's some index size cost here so you'll have to test.

Really go back to your use-case to see if this is _really_ necessary
though. Often people think it is because that's the only way they've
been able to search at all in SQL and it can turn out that there are
other ways to solve it. IOW, this could be an XY problem.

Best,
Erick

On Mon, Jan 30, 2017 at 1:52 AM, Maciej Ł. PCSS  wrote:

Hi All,

What solution have you applied in your implementations?

Regards
Maciej


W dniu 24.01.2017 o 14:10, Maciej Ł. PCSS pisze:

Dear SOLR users,

please point me to the right solution of my problem. I'm using SOLR to
implement a Google-like search in my application and this scenario is
working fine.

However, in specific use-cases I need to filter documents that include a
specific substring in a given field. It's about an SQL-like query similar to
this:

SELECT *  FROM table WHERE someField = '%c def g%'

I expect to match documents having someField ='abc def ghi'. That means I
expect match parts of words.

As I understand SOLR, as a reversed-index, does work with tokens rather
that character strings and thereby looks for whole words (not substrings).

Is there any solution for such an issue?

Regards
Maciej Łabędzki






Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

2017-01-30 Thread Erick Erickson
Well, the usual Solr solution to leading and trailing wildcards is to
ngram the field. You can get the entire field (incuding spaces) to be
analyzed as a whole by using KeywordTokenizer. Sometimes you wind up
using a copyField to support this and search against one or the other
if necessary.

You can do this with KeywordTokenizer and '*a bcd ef*", but that'll be
slow for the exact same reason the SQL query is slow: It has to
examine every value in every document to find terms that match then
search on those.

There's some index size cost here so you'll have to test.

Really go back to your use-case to see if this is _really_ necessary
though. Often people think it is because that's the only way they've
been able to search at all in SQL and it can turn out that there are
other ways to solve it. IOW, this could be an XY problem.

Best,
Erick

On Mon, Jan 30, 2017 at 1:52 AM, Maciej Ł. PCSS  wrote:
> Hi All,
>
> What solution have you applied in your implementations?
>
> Regards
> Maciej
>
>
> W dniu 24.01.2017 o 14:10, Maciej Ł. PCSS pisze:
>>
>> Dear SOLR users,
>>
>> please point me to the right solution of my problem. I'm using SOLR to
>> implement a Google-like search in my application and this scenario is
>> working fine.
>>
>> However, in specific use-cases I need to filter documents that include a
>> specific substring in a given field. It's about an SQL-like query similar to
>> this:
>>
>> SELECT *  FROM table WHERE someField = '%c def g%'
>>
>> I expect to match documents having someField ='abc def ghi'. That means I
>> expect match parts of words.
>>
>> As I understand SOLR, as a reversed-index, does work with tokens rather
>> that character strings and thereby looks for whole words (not substrings).
>>
>> Is there any solution for such an issue?
>>
>> Regards
>> Maciej Łabędzki
>
>


Re: SQL-like queries (with percent character) - matching an exact substring, with parts of words

2017-01-30 Thread Maciej Ł. PCSS

Hi All,

What solution have you applied in your implementations?

Regards
Maciej


W dniu 24.01.2017 o 14:10, Maciej Ł. PCSS pisze:

Dear SOLR users,

please point me to the right solution of my problem. I'm using SOLR to 
implement a Google-like search in my application and this scenario is 
working fine.


However, in specific use-cases I need to filter documents that include 
a specific substring in a given field. It's about an SQL-like query 
similar to this:


SELECT *  FROM table WHERE someField = '%c def g%'

I expect to match documents having someField ='abc def ghi'. That 
means I expect match parts of words.


As I understand SOLR, as a reversed-index, does work with tokens 
rather that character strings and thereby looks for whole words (not 
substrings).


Is there any solution for such an issue?

Regards
Maciej Łabędzki