[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905037#comment-15905037 ] Alex Petrov commented on CASSANDRA-12915: - Committed as [2c111d15bb080283b9b98d48fab4bcf4db515b5a|https://github.com/apache/cassandra/commit/2c111d15bb080283b9b98d48fab4bcf4db515b5a] to 3.11 and merged up to trunk. A small side note: this was my very first time as I have committed anyone else's code to the repository, and I have accidentally forgotten to include {{--amend}} with author name, so the patch got pushed under my credentials. Since we're a big project we can not really amend the history on primary branches and just force-push. I've included a proper attribution on the authorship in the patch comment, but the github history unfortunately would still hold my email. I'm really sorry about that, I will take an extra care next time. Thank you for understanding, > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903086#comment-15903086 ] Corentin Chary commented on CASSANDRA-12915: LGTM, Thanks for cleaning up, this is way better now > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902793#comment-15902793 ] Alex Petrov commented on CASSANDRA-12915: - CI looks pretty broken, but not because of this patch: |[patch|https://github.com/apache/cassandra/compare/trunk...ifesdjeen:12915-alternative]|[utest|http://cassci.datastax.com/job/ifesdjeen-12915-alternative-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/job/ifesdjeen-12915-alternative-testall/lastCompletedBuild/testReport/] > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901335#comment-15901335 ] Corentin Chary commented on CASSANDRA-12915: Looks good now, would be nice to see the results of the CI on this version :) > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901028#comment-15901028 ] Alex Petrov commented on CASSANDRA-12915: - bq. I think it would be better not to make the assumption that null range == empty range. Mostly because it isn't treated the same way in add() Agree, fixed. bq. That's not how intersection usually works You're right. I was thinking to make it for both union and intersection cases. But since we use it only in the intersection case, we can flip the "emptiness" logic and still remove the check for empty. > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900977#comment-15900977 ] Corentin Chary commented on CASSANDRA-12915: {code} this(range == null ? null : range.min, range == null ? null : range.max, range == null ? 0 : range.count);{code} I think it would be better not to make the assumption that null range == empty range. Mostly because it isn't treated the same way in add() {code} If either range is empty. Empty range is a subrange of (overlaps with) any range.{code} That's not how intersection usually works, shouldn't the result of an empty range intersection with anything be an empty range ? (which means that an empty range overlaps with nothing) > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900948#comment-15900948 ] Alex Petrov commented on CASSANDRA-12915: - I've looked at the code once again and turns out that we can't rely on disjoint for determining whether to return an empty iterator or no, since in case with union we would like to return just the iterators that produce results (empty ones won't produce any anyways) and in case with intersection, even though empty is overlapping with every set, we should make a distinction, since intersection with an empty iterator is empty. I have missed this yesterday and my tests were passing only by chance (since intersections were disjoint by themselves anyways). I've addressed the issues [here|https://github.com/apache/cassandra/compare/trunk...ifesdjeen:12915-alternative]. A couple of comments on motivation: * a bit more tests to make sure we cover more cases * one of the problems revealed by new tests was that the original patch was yielding a bounce intersection iterator (which actually has min/max), but with empty range. Now we consistently return empty iterator that doesn't have min and max set. * I wanted to avoid making a distinction for the first vs the rest ranges, mostly to use same code path * hopefully it became clearer when the empty iterator is going to be returned Could you take another look at the patch and see if we have common ground here? Thank you once again for clarifications and discussions: it's a complex problem, was hard to discover and isn't very simple to tackle from all sides simultaneously. > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898877#comment-15898877 ] Alex Petrov commented on CASSANDRA-12915: - Sure, what I'm saying is that behaviour is unnecessary to fix the problem that we have. SASI Range iterators are taking the shortest range. For example, if we have records like || a || b || c || | 1 | 1 | 1 | | 2 | 2 | 1 | | 3 | 3 | 1 | And {{b}} and {{c}} are SASI-indexed, and we want to run the query {{WHERE b = 2 AND c = 1}}, we will get 2 iterators, one with {{2}} (for the column {{b}}) and the second one with {{1,2,3}} (for the column {{c}}). Now, SASI will take the shortest range ({{b}}) and start iterating by "rewinding" the {{c}} iterator to the token {{2}}. If there's no item for the token, intersection will be empty. Now, moving closer to the problem. If we had just imbalanced iterators (one returning 100 results and the other just 1), SASI would be able to efficiently intersect ranges and hit the storage to retrieve just a single row. In the case with __empty__ results from one of the iterators, the one you're fixing, we were simply using a single iterator, and had to hit the storage a 100 times (assuming the other iterator returns 100 results). Now, with what I propose, we will hit is 0 times. Same with your approach, although with a lot of complex logic that I can not justify, so I have asked you to clarify. The trick was simply to "replace" that [null check|https://github.com/ifesdjeen/cassandra/commit/78b1ff630536b0f48787ced74a66d702d13637ba#diff-25d7f486e2818c56d6b01aa952d459f3L146] with an actual iterator, to let the intersection know there's a second part, and it's empty. I thought it was already discussed in [this comment|https://issues.apache.org/jira/browse/CASSANDRA-12915?focusedCommentId=15728412=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15728412] but I'm happy to re-iterate: bq. The way RangeIterator work and how it's optimised, we're [picking|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/utils/RangeIntersectionIterator.java#L234-L237] the "shortest" token tree and start skipping the second token tree based on the results retrieved from the first one. So one of the indexes will be iterated anyways. The second index (since it's larger) might have to fetch/load roughly 10 blocks (since they might be located far from one another on disk), but it never has to fetch all 2M items. It'll iterate only as many items as the smallest index has. For example, two index queries would skip through (left is which index is used, right is index of the item within the token tree): and bq. Moreover, RangeIterator already has [different optimisation strategies|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/utils/RangeIntersectionIterator.java#L76-L78] based on differences in cardinality. I'd say that current benchmark shows that the query is slightly slower (since we have to go through around twice as much data on disk). But given the numbers at hand the difference that is small and it's sub-millisecond, this optimisation seems not to pay the complexity that it brings. That's one of the ideas behind SASI, pretty much: to be able to efficiently merge, iterate and skip iterators. Looks like we're mostly on the same page. I've ran large-scale tests with a variant of the patch I've posted (160+GB of data), and it works exactly as expected. Now, I want to simplify the patch and make sure we do not add the code we do not need. So there's no need to add additional logic to hide empty ranges, it's already handled. > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898148#comment-15898148 ] Corentin Chary commented on CASSANDRA-12915: Could you re-phrase the question ? I though I answered everything from [this comment|https://issues.apache.org/jira/browse/CASSANDRA-12915?focusedCommentId=15897393=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15897393] but it looks like I didn't. The idea of my approach is that I'm looking for this behavior: {code} builder = RangeIntersectionIterator.builder(strategy); builder.add(new LongIterator(new long[] {})); builder.add(new LongIterator(new long[] {1})); range = builder.build(); Assert.assertEquals(0, range.getCount()); Assert.assertFalse(range.hasNext()); // (optimized though isOverlapping() returning false {code} In other words, adding an empty iterator to a RangeIntersectionIterator should make it empty and there is a strong different between an empty and null iterator. I believe in your case your empty iterator will just get ignored because you need to remove this check: https://github.com/ifesdjeen/cassandra/blob/78b1ff630536b0f48787ced74a66d702d13637ba/src/java/org/apache/cassandra/index/sasi/utils/RangeIterator.java#L151 > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897907#comment-15897907 ] Alex Petrov commented on CASSANDRA-12915: - This was a piece of test that remained from your code, the approach I suggest is slightly different, so I'm not surprised the empty range still counts as a range. My point is that we could have even taken out the null check from [here|https://github.com/ifesdjeen/cassandra/commit/78b1ff630536b0f48787ced74a66d702d13637ba#diff-25d7f486e2818c56d6b01aa952d459f3L146] and replaced it with an empty iterator and would still achieve the same result and get the issue fixed. If you disagree with the approach I'm proposing, you could answer my initial question and explain. I'm open, I just want to understand before we get it committed. > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897545#comment-15897545 ] Corentin Chary commented on CASSANDRA-12915: The fact that you didn't change the following line makes me thing that your patch doesn't really do what we need: Assert.assertEquals(1L, builder.add(new LongIterator(new long[] {})).rangeCount()); Empty ranges really should not get ignored, and the changes made in https://github.com/ifesdjeen/cassandra/commit/78b1ff630536b0f48787ced74a66d702d13637ba#diff-22e58be2cfd42af959cb63c97de7eb3cR246 show that the code do not behave like we would like it to. > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897529#comment-15897529 ] Alex Petrov commented on CASSANDRA-12915: - I've poked around a bit and could not see any behaviour change if we only introduce an empty iterator, like [here|https://github.com/ifesdjeen/cassandra/commit/78b1ff630536b0f48787ced74a66d702d13637ba] (this is by no means meant as a final version of the patch, only bringing it here as an example and a base for discussion). I'm just trying to understand the purpose of the rest of changes, it'd be good to hear your opinion. > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897430#comment-15897430 ] Corentin Chary commented on CASSANDRA-12915: * Removing ranges.isEmpty() happens in another function. Removing it doesn't change anything as forEach() will iterate on an empty list. * True for min() and max(). It's this way for the switch() because computing min / max keys with an empty range doesn't make much sense. Anything else ? If not I'll remove the duplicated code in min() and max() > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897393#comment-15897393 ] Alex Petrov commented on CASSANDRA-12915: - Thank you very much for the patch, I have a couple of questions: * [here|https://github.com/iksaif/cassandra/commit/4c8981a9b900a38dd67b3bb1560aaac7cc7ccfac#diff-88eaa3c77aa17a84006ad76f3151ed31L162], as far as I understand the code (and purpose), adding an empty range to current range doesn't change the range, why do you think we should remove {{ranges.isEmpty()}}? * when comparing range sizes, there's a check for emptiness which looks redundant, for example [here|https://github.com/iksaif/cassandra/commit/4c8981a9b900a38dd67b3bb1560aaac7cc7ccfac#diff-88eaa3c77aa17a84006ad76f3151ed31R294] and [here|https://github.com/iksaif/cassandra/commit/4c8981a9b900a38dd67b3bb1560aaac7cc7ccfac#diff-88eaa3c77aa17a84006ad76f3151ed31R301] also, [here|https://github.com/iksaif/cassandra/commit/4c8981a9b900a38dd67b3bb1560aaac7cc7ccfac#diff-88eaa3c77aa17a84006ad76f3151ed31R255]. In other words, there's nothing special about empty ranges. We can just return an empty range instead of null and that's pretty much it. > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: sasi >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 3.11.x, 4.x > > > It looks like RangeIntersectionIterator.java and be pretty inefficient in > some cases. Let's take the following query: > SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar'; > In this case: > * index1 = 'foo' will match 2 items > * index2 = 'bar' will match ~300k items > On my setup, the query will take ~1 sec, most of the time being spent in > disk.TokenTree.getTokenAt(). > if I patch RangeIntersectionIterator so that it doesn't try to do the > intersection (and effectively only use 'index1') the query will run in a few > tenth of milliseconds. > I see multiple solutions for that: > * Add a static thresold to avoid the use of the index for the intersection > when we know it will be slow. Probably when the range size factor is very > small and the range size is big. > * CASSANDRA-10765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient
[ https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871958#comment-15871958 ] Corentin Chary commented on CASSANDRA-12915: {code} CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true; CREATE TABLE test.test ( r text PRIMARY KEY, a text, b text, c text, data text ); CREATE CUSTOM INDEX test_a_idx ON test.test (a) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'true'}; CREATE CUSTOM INDEX test_c_idx ON test.test (c) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'true'}; CREATE CUSTOM INDEX test_b_idx ON test.test (b) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'true'}; {code} {code} $ cat > generate.py import sys import random def main(args): n = int(args[1]) for i in xrange(n): a = '0' b = i % 10 c = i % (n / 10) + random.randint(0, 10) print ("%d,%s,%d,%d,%d" % (i, a, b, c, i)) if __name__ == '__main__': main(sys.argv) $ python generate.py 200 > test.csv {code} {code} COPY test.test FROM 'test.csv' WITH MAXBATCHSIZE = 100 AND MAXATTEMPTS = 10 AND MAXINSERTERRORS = 99; {code} {code} cqlsh> SELECT * FROM test.test WHERE a = '1' AND c = '38151' LIMIT 1 ALLOW FILTERING; r | a | b | c | data ---+---+---+---+-- (0 rows) Tracing session: fbc23200-f522-11e6-95df-69d39475f5a8 activity | timestamp | source| source_elapsed | client ---++---++--- Execute CQL3 query | 2017-02-17 16:08:48.288000 | 127.0.0.1 | 0 | 127.0.0.1 Parsing SELECT * FROM test.test WHERE a = '1' AND c = '38151' LIMIT 1 ALLOW FILTERING; [Native-Transport-Requests-1] | 2017-02-17 16:08:48.288000 | 127.0.0.1 |268 | 127.0.0.1 Preparing statement [Native-Transport-Requests-1] | 2017-02-17 16:08:48.289000 | 127.0.0.1 |513 | 127.0.0.1 Index mean cardinalities are test_a_idx:-9223372036854775808,test_c_idx:-9223372036854775808. Scanning with test_a_idx. [Native-Transport-Requests-1] | 2017-02-17 16:08:48.289000 | 127.0.0.1 |913 | 127.0.0.1 Computing ranges to query [Native-Transport-Requests-1] | 2017-02-17 16:08:48.289000 | 127.0.0.1 | 1027 | 127.0.0.1 Submitting range requests on 257 ranges with a concurrency of 1 (-3.24259165E16 rows per range expected) [Native-Transport-Requests-1] | 2017-02-17 16:08:48.289001 | 127.0.0.1 | 1319 | 127.0.0.1 Submitted 1 concurrent range requests [Native-Transport-Requests-1] | 2017-02-17 16:08:48.29 | 127.0.0.1 | 2229 | 127.0.0.1 Executing read on test.test using index test_a_idx [ReadStage-3] | 2017-02-17 16:08:48.292000 | 127.0.0.1 | 3494 | 127.0.0.1 Read 0 live and 0 tombstone cells [ReadStage-3] | 2017-02-17 16:08:48.293000 | 127.0.0.1 | 4694 | 127.0.0.1 Request complete | 2017-02-17 16:08:48.292930 | 127.0.0.1 | 4930 | 127.0.0.1 {code} Yay ! No more iterating on the useless index. Patch is on https://github.com/iksaif/cassandra/tree/sasi-null-intersect > SASI: Index intersection with an empty range really inefficient > --- > > Key: CASSANDRA-12915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12915 > Project: Cassandra > Issue Type: Improvement > Components: