[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742064#comment-15742064
 ] 

Alex Petrov commented on CASSANDRA-12915:
-----------------------------------------

This was something I have initially suspected, this was also why I've asked 
whether the problem is related to {{LIKE}} 
[previously|https://issues.apache.org/jira/browse/CASSANDRA-12915?focusedCommentId=15725035&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15725035].
 
Certainly, {{LIKE}} queries are much less performant, although I'm afraid my 
benchmarking has shown bit smaller differences (although still obviously 
slower).

In order to do the "simple" query, SASI simply grabs the so called term from 
the index. Term search is relatively inexpensive, and
locating a single term is cheap. Now, when we're moving to the {{LIKE}} query, 
instead of fetching a single term, we're fetching 
_many terms_ using {{TermIterator}}. For each of this terms, {{TokenTree}} is 
going to be composed. Unfortunately, on that
level not that many optimisations are done. Most of them are performed on the 
{{TokenTree}} level. 

So of course the slowness of the fact that we're doing a whole lot of work is 
showing up when one index result is disproportionally 
larger than the other one, but you can see somewhat similar effects even when 
the token trees are of comparable sizes (although larger
than just a couple of items).

A couple of things to try / play with in order to improve your query 
performance: 

   * Take a closer look at tokenisers, might be you'll be better off by 
tokenising your data differently and performing EQ queries instead
of {{LIKE}} term ranges
   * Looks like one of the items in your index is disproportionally larger than 
the other (possibly even binary, or of a very low cardinality).
It might be possible to just do drop index and do filtering. If you really have 
to query the second column, use "native" 2i

To summarise: I agree there's an inefficiency that's shown by the query type 
you're proposing. I think we should solve this inefficiency
by changing the way {{searchRange}} / {{searchPoint}} is working in the 
{{OnDiskIndex}}, but this may be a very big change, plus we have 
to take into account how that plays with the fact that the index is SSTable 
attached. 

Having that said, whenever we have a proper query planner at hand on the 
top-level, we should definitely include rule-based optimisation
process, and this might become one of the rules. To fully solve it, we'll also 
have to have proper cardinality estimation (maybe with 
intersections etc) in place. 

> SASI: Index intersection can be very inefficient
> ------------------------------------------------
>
>                 Key: CASSANDRA-12915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: sasi
>            Reporter: Corentin Chary
>             Fix For: 3.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to