[
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742064#comment-15742064
]
Alex Petrov commented on CASSANDRA-12915:
-----------------------------------------
This was something I have initially suspected, this was also why I've asked
whether the problem is related to {{LIKE}}
[previously|https://issues.apache.org/jira/browse/CASSANDRA-12915?focusedCommentId=15725035&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15725035].
Certainly, {{LIKE}} queries are much less performant, although I'm afraid my
benchmarking has shown bit smaller differences (although still obviously
slower).
In order to do the "simple" query, SASI simply grabs the so called term from
the index. Term search is relatively inexpensive, and
locating a single term is cheap. Now, when we're moving to the {{LIKE}} query,
instead of fetching a single term, we're fetching
_many terms_ using {{TermIterator}}. For each of this terms, {{TokenTree}} is
going to be composed. Unfortunately, on that
level not that many optimisations are done. Most of them are performed on the
{{TokenTree}} level.
So of course the slowness of the fact that we're doing a whole lot of work is
showing up when one index result is disproportionally
larger than the other one, but you can see somewhat similar effects even when
the token trees are of comparable sizes (although larger
than just a couple of items).
A couple of things to try / play with in order to improve your query
performance:
* Take a closer look at tokenisers, might be you'll be better off by
tokenising your data differently and performing EQ queries instead
of {{LIKE}} term ranges
* Looks like one of the items in your index is disproportionally larger than
the other (possibly even binary, or of a very low cardinality).
It might be possible to just do drop index and do filtering. If you really have
to query the second column, use "native" 2i
To summarise: I agree there's an inefficiency that's shown by the query type
you're proposing. I think we should solve this inefficiency
by changing the way {{searchRange}} / {{searchPoint}} is working in the
{{OnDiskIndex}}, but this may be a very big change, plus we have
to take into account how that plays with the fact that the index is SSTable
attached.
Having that said, whenever we have a proper query planner at hand on the
top-level, we should definitely include rule-based optimisation
process, and this might become one of the rules. To fully solve it, we'll also
have to have proper cardinality estimation (maybe with
intersections etc) in place.
> SASI: Index intersection can be very inefficient
> ------------------------------------------------
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
> Issue Type: Improvement
> Components: sasi
> Reporter: Corentin Chary
> Fix For: 3.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the
> intersection (and effectively only use 'index1') the query will run in a few
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection
> when we know it will be slow. Probably when the range size factor is very
> small and the range size is big.
> * CASSANDRA-10765
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)