[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742552#comment-15742552
 ] 

Alex Petrov commented on CASSANDRA-12915:
-----------------------------------------

bq. But really it would be better if the query optimizer was in the database 
rather than in our code.

I'm sorry if it sounded like I'm opposed to the change. I'm just saying that 
the bottleneck is in the way Terms are iterated, so we may want to address this 
one first. Also, I'm sure this way does not bring in any regressions, but since 
if 
we improve Term iteration, we may bring the performance to the levels 
comparable with what happens in the {{TokenTree}}, which would mean that we 
don't have to ignore one of the indexes. In my opinion, join of two indexes 
is a manifestation and not a source of the problem, so I'd rather fix the cause.

bq. Do you see a quick way of fixing the 'empty iterator' issue ? 

Sure, I have a patch in my branch, which pretty much literally returns an empty 
iterator (one that'd return {{false}} for {{hasNext}} and {{endOfData}} for 
{{next}}. 

bq. Would you agree to chance the format to include proper cardinality 
estimation to the index (so we can build upon later)

It's not my decision, you can post the proposal to the Mailing List and 
everyone who's involved with SASI can contribute to the discussion and provide 
insightful feedback. At this point I'm not sure about 
a) how exactly to do that, since technically we have counts on token tree 
level, but it seems that with LIKE the problem is one level higher, on TERM 
level, so we have to know how many items would match a non-EQ query. I'm not 
aware
of any way to do this without a linear scan, tree or trie (e.g. estimate). My 
initial comment was related to EQ. 
b) if we really need to follow this path or understand how to optimise {{LIKE}} 
prefix queries instead. But this may be a very significant chunk of work. 

bq. What would be the ETA for a proper query planner ? If O(months), would it 
make sense to merge something like what I did to bring some performance 
improvement in the meantime ?

Unfortunately I do not have any estimates on that. In my personal opinion, 
Query Planner in SASI is secondary to the improvements to RangeIterators, 
#11990 (and all opportunities that it opens before us), possible optimisations 
for how LIKE queries are iterated
are going to both solve this particular problem and give much better results in 
"average case". 

Hope that helps. 

> SASI: Index intersection can be very inefficient
> ------------------------------------------------
>
>                 Key: CASSANDRA-12915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: sasi
>            Reporter: Corentin Chary
>             Fix For: 3.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to