Re: Tokenization and SAI query syntax

Atri Sharma Mon, 07 Aug 2023 10:55:07 -0700

Why not start with SQLish operators supported by many databases (LIKE and
CONTAINS)?


On Mon, Aug 7, 2023 at 10:01 PM J. D. Jordan <[email protected]>
wrote:

> I am also -1 on directly exposing lucene like syntax here. Besides being
> ugly, SAI is not lucene, I do not think we should start using lucene syntax
> for it, it will make people think they can do everything else lucene allows.
>
> On Aug 7, 2023, at 5:13 AM, Benedict <[email protected]> wrote:
>
> 
> I’m strongly opposed to :
>
> It is very dissimilar to our current operators. CQL is already not the
> prettiest language, but let’s not make it a total mish mash.
>
>
>
> On 7 Aug 2023, at 10:59, Mike Adamson <[email protected]> wrote:
>
> 
> I am also in agreement with 'column : token' in that 'I don't hate it' but
> I'd like to offer an alternative to this in 'column HAS token'. HAS is
> currently not a keyword that we use so wouldn't cause any brain conflicts.
>
> While I don't hate ':' I have a particular dislike of the lucene search
> syntax because of its terseness and lack of easy readability.
>
> Saying that, I'm happy to do with ':' if that is the decision.
>
> On Fri, 4 Aug 2023 at 00:23, Jon Haddad <[email protected]>
> wrote:
>
>> Assuming SAI is a superset of SASI, and we were to set up something so
>> that SASI indexes auto convert to SAI, this gives even more weight to my
>> point regarding how differing behavior for the same syntax can lead to
>> issues.  Imo the best case scenario results in the user not even noticing
>> their indexes have changed.
>>
>> An (maybe better?) alternative is to add a flag to the index
>> configuration for "compatibility mod", which might address the concerns
>> around using an equality operator when it actually is a partial match.
>>
>> For what it's worth, I'm in agreement that = should mean full equality
>> and not token match.
>>
>> On 2023/08/03 03:56:23 Caleb Rackliffe wrote:
>> > For what it's worth, I'd very much like to completely remove SASI from
>> the
>> > codebase for 6.0. The only remaining functionality gaps at the moment
>> are
>> > LIKE (prefix/suffix) queries and its limited tokenization
>> > capabilities, both of which already have SAI Phase 2 Jiras.
>> >
>> > On Wed, Aug 2, 2023 at 7:20 PM Jeremiah Jordan <[email protected]>
>> > wrote:
>> >
>> > > SASI just uses “=“ for the tokenized equality matching, which is the
>> exact
>> > > thing this discussion is about changing/not liking.
>> > >
>> > > > On Aug 2, 2023, at 7:18 PM, J. D. Jordan <[email protected]
>> >
>> > > wrote:
>> > > >
>> > > > I do not think LIKE actually applies here. LIKE is used for prefix,
>> > > contains, or suffix searches in SASI depending on the index type.
>> > > >
>> > > > This is about exact matching of tokens.
>> > > >
>> > > >> On Aug 2, 2023, at 5:53 PM, Jon Haddad <[email protected]
>> >
>> > > wrote:
>> > > >>
>> > > >> Certain bits of functionality also already exist on the SASI side
>> of
>> > > things, but I'm not sure how much overlap there is.  Currently,
>> there's a
>> > > LIKE keyword that handles token matching, although it seems to have
>> some
>> > > differences from the feature set in SAI.
>> > > >>
>> > > >> That said, there seems to be enough of an overlap that it would
>> make
>> > > sense to consider using LIKE in the same manner, doesn't it?  I think
>> it
>> > > would be a little odd if we have different syntax for different
>> indexes.
>> > > >>
>> > > >> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md
>> > > >>
>> > > >> I think one complication here is that there seems to be a desire,
>> that
>> > > I very much agree with, to expose as much of the underlying
>> flexibility of
>> > > Lucene as much as possible.  If it means we use Caleb's suggestion,
>> I'd ask
>> > > that the queries that SASI and SAI both support use the same syntax,
>> even
>> > > if it means there's two ways of writing the same query.  To use
>> Caleb's
>> > > example, this would mean supporting both LIKE and the `expr` column.
>> > > >>
>> > > >> Jon
>> > > >>
>> > > >>>> On 2023/08/01 19:17:11 Caleb Rackliffe wrote:
>> > > >>> Here are some additional bits of prior art, if anyone finds them
>> > > useful:
>> > > >>>
>> > > >>>
>> > > >>> The Stratio Lucene Index -
>> > > >>> https://github.com/Stratio/cassandra-lucene-index#examples
>> > > >>>
>> > > >>> Stratio was the reason C* added the "expr" functionality. They
>> embedded
>> > > >>> something similar to ElasticSearch JSON, which probably isn't my
>> > > favorite
>> > > >>> choice, but it's there.
>> > > >>>
>> > > >>>
>> > > >>> The ElasticSearch match query syntax -
>> > > >>>
>> > >
>> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html__;!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAf0MsxZ9$
>> > > >>>
>> > > >>> Again, not my favorite. It's verbose, and probably too powerful
>> for us.
>> > > >>>
>> > > >>>
>> > > >>> ElasticSearch's documentation for the basic Lucene query syntax -
>> > > >>>
>> > >
>> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html*query-string-syntax__;Iw!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAXEPP1sK$
>> > > >>>
>> > > >>> One idea is to take the basic Lucene index, which it seems we
>> already
>> > > have
>> > > >>> some support for, and feed it to "expr". This is nice for two
>> reasons:
>> > > >>>
>> > > >>> 1.) People can just write Lucene queries if they already know how.
>> > > >>> 2.) No changes to the grammar.
>> > > >>>
>> > > >>> Lucene has distinct concepts of filtering and querying, and this
>> is
>> > > kind of
>> > > >>> the latter. I'm not sure how, for example, we would want "expr" to
>> > > interact
>> > > >>> w/ filters on other column indexes in vanilla CQL space...
>> > > >>>
>> > > >>>
>> > > >>>> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie <
>> [email protected]>
>> > > wrote:
>> > > >>>>
>> > > >>>> `column CONTAINS term`. Contains is used by both Java and Python
>> for
>> > > >>>> substring searches, so at least some users will be surprised by
>> > > term-based
>> > > >>>> behavior.
>> > > >>>>
>> > > >>>> I wonder whether users are in their "programming language"
>> headspace
>> > > or in
>> > > >>>> their "querying a database" headspace when interacting with CQL?
>> i.e.
>> > > this
>> > > >>>> would only present confusion if we expected users to be thinking
>> in
>> > > the
>> > > >>>> idioms of their respective programming languages. If they're
>> thinking
>> > > in
>> > > >>>> terms of SQL, MATCHES would probably end up confusing them a bit
>> > > since it
>> > > >>>> doesn't match the general structure of the MATCH operator.
>> > > >>>>
>> > > >>>> That said, I also think CONTAINS loses something important that
>> you
>> > > allude
>> > > >>>> to here Jonathan:
>> > > >>>>
>> > > >>>> with corresponding query-time tokenization and analysis.  This
>> means
>> > > that
>> > > >>>> the query term is not always a substring of the original string!
>> > > Besides
>> > > >>>> obvious transformations like lowercasing, you have things like
>> > > >>>> PhoneticFilter available as well.
>> > > >>>>
>> > > >>>> So to me, neither MATCHES nor CONTAINS are particularly great
>> > > candidates.
>> > > >>>>
>> > > >>>> So +1 to the "I don't actually hate it" sentiment on:
>> > > >>>>
>> > > >>>> column : term`. Inspired by Lucene’s syntax
>> > > >>>>
>> > > >>>>
>> > > >>>>> On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote:
>> > > >>>>
>> > > >>>>
>> > > >>>> I have a strong preference not to use the name of an SQL
>> operator,
>> > > since
>> > > >>>> it precludes us later providing the SQL standard operator to
>> users.
>> > > >>>>
>> > > >>>> What about CONTAINS TOKEN term? Or CONTAINS TERM term?
>> > > >>>>
>> > > >>>>
>> > > >>>>> On 24 Jul 2023, at 13:34, Andrés de la Peña <
>> [email protected]>
>> > > wrote:
>> > > >>>>
>> > > >>>> 
>> > > >>>> `column = term` is definitively problematic because it creates an
>> > > >>>> ambiguity when the queried column belongs to the primary key.
>> For some
>> > > >>>> queries we wouldn't know whether the user wants a primary key
>> query
>> > > using
>> > > >>>> regular equality or an index query using the analyzer.
>> > > >>>>
>> > > >>>> `term_matches(column, term)` seems quite clear and hard to
>> > > misinterpret,
>> > > >>>> but it's quite long to write and its implementation will be
>> > > challenging
>> > > >>>> since we would need a bunch of special casing around
>> SelectStatement
>> > > and
>> > > >>>> functions.
>> > > >>>>
>> > > >>>> LIKE, MATCHES and CONTAINS could be a bit misleading since they
>> seem
>> > > to
>> > > >>>> evoke different behaviours to what they would have.
>> > > >>>>
>> > > >>>> `column LIKE :term:` seems a bit redundant compared to just using
>> > > `column
>> > > >>>> : term`, and we are still introducing a new symbol.
>> > > >>>>
>> > > >>>> I think I like `column : term` the most, because it's brief, it's
>> > > similar
>> > > >>>> to the equivalent Lucene's syntax, and it doesn't seem to clash
>> with
>> > > other
>> > > >>>> different meanings that I can think of.
>> > > >>>>
>> > > >>>>> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis <[email protected]
>> >
>> > > wrote:
>> > > >>>>
>> > > >>>> Hi all,
>> > > >>>>
>> > > >>>> With phase 1 of SAI wrapping up, I’d like to start the ball
>> rolling on
>> > > >>>> aligning around phase 2 features.
>> > > >>>>
>> > > >>>> In particular, we need to nail down the syntax for doing
>> non-exact
>> > > string
>> > > >>>> matches.  We have a proof of concept that includes full Lucene
>> > > analyzer and
>> > > >>>> filter functionality – just the text transformation pieces, none
>> of
>> > > the
>> > > >>>> storage parts – which is the gold standard in this space.  For
>> > > example, the
>> > > >>>> StandardAnalyzer [1] lowercases all terms and removes stopwords
>> > > (common
>> > > >>>> words like “a”, “is”, “the” that are usually not useful to search
>> > > >>>> against).  Lucene also has classes that offer stemming, special
>> case
>> > > >>>> handling for email, and many languages besides English [2].
>> > > >>>>
>> > > >>>> What syntax should we use to express “rows whose analyzed tokens
>> match
>> > > >>>> this search term?”
>> > > >>>>
>> > > >>>> The syntax must be clear that we want to look for this term
>> within the
>> > > >>>> column data using the configured index with corresponding
>> query-time
>> > > >>>> tokenization and analysis.  This means that the query term is not
>> > > always a
>> > > >>>> substring of the original string!  Besides obvious
>> transformations
>> > > like
>> > > >>>> lowercasing, you have things like PhoneticFilter available as
>> well.
>> > > >>>>
>> > > >>>> Here are my thoughts on some of the options:
>> > > >>>>
>> > > >>>> `column = term`.  This is what the POC does today and it’s super
>> > > confusing
>> > > >>>> to overload = to mean something other than exact equality.  I am
>> not
>> > > a fan.
>> > > >>>>
>> > > >>>> `column LIKE term` or `column LIKE %term%`. The closest SQL
>> operator,
>> > > but
>> > > >>>> neither the wildcarded nor unwildcarded syntax matches the
>> semantics
>> > > of
>> > > >>>> term-based search.
>> > > >>>>
>> > > >>>> `column MATCHES term`. I rather like this one, although Mike
>> points
>> > > out
>> > > >>>> that “match” has a meaning in the context of regular expressions
>> that
>> > > could
>> > > >>>> cause confusion here.
>> > > >>>>
>> > > >>>> `column CONTAINS term`. Contains is used by both Java and Python
>> for
>> > > >>>> substring searches, so at least some users will be surprised by
>> > > term-based
>> > > >>>> behavior.
>> > > >>>>
>> > > >>>> `term_matches(column, term)`. Postgresql FTS makes you use
>> functions
>> > > like
>> > > >>>> this for everything.  It’s pretty clunky, and we would need to
>> make
>> > > the
>> > > >>>> amazingly hairy SelectStatement even hairier to handle “use a
>> function
>> > > >>>> result in a predicate” like this.
>> > > >>>>
>> > > >>>> `column : term`. Inspired by Lucene’s syntax.  I don’t actually
>> hate
>> > > it.
>> > > >>>>
>> > > >>>> `column LIKE :term:`. Stick with the LIKE operator but add a new
>> > > symbol to
>> > > >>>> indicate term matching.  Arguably more SQL-ish than a new bare
>> symbol
>> > > >>>> operator.
>> > > >>>>
>> > > >>>> [1]
>> > > >>>>
>> > >
>> https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
>> > > >>>> [2]
>> https://lucene.apache.org/core/9_7_0/analysis/common/index.html
>> > > >>>>
>> > > >>>> --
>> > > >>>> Jonathan Ellis
>> > > >>>> co-founder, http://www.datastax.com
>> > > >>>> @spyced
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>
>> > >
>> >
>>
>
>
> --
> [image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
> Engineering
>
> +1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
> Find DataStax Online: [image: LinkedIn Logo]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
>    [image: Facebook Logo]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
>    [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS
> Feed] <https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
> <https://github.com/datastax>
>
>

-- 
Regards,

Atri
Apache Concerted

Re: Tokenization and SAI query syntax

Reply via email to