[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-03-10 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905037#comment-15905037
 ] 

Alex Petrov commented on CASSANDRA-12915:
-

Committed as 
[2c111d15bb080283b9b98d48fab4bcf4db515b5a|https://github.com/apache/cassandra/commit/2c111d15bb080283b9b98d48fab4bcf4db515b5a]
 to 3.11 and merged up to trunk.

A small side note: this was my very first time as I have committed anyone 
else's code to the repository, and I have accidentally forgotten to include 
{{--amend}} with author name, so the patch got pushed under my credentials. 
Since we're a big project we can not really amend the history on primary 
branches and just force-push. I've included a proper attribution on the 
authorship in the patch comment, but the github history unfortunately would 
still hold my email. I'm really sorry about that, I will take an extra care 
next time. Thank you for understanding,

> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
>Assignee: Corentin Chary
> Fix For: 3.11.x, 4.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-03-09 Thread Corentin Chary (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903086#comment-15903086
 ] 

Corentin Chary commented on CASSANDRA-12915:


LGTM, Thanks for cleaning up, this is way better now

> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
>Assignee: Corentin Chary
> Fix For: 3.11.x, 4.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-03-09 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902793#comment-15902793
 ] 

Alex Petrov commented on CASSANDRA-12915:
-

CI looks pretty broken, but not because of this patch: 

|[patch|https://github.com/apache/cassandra/compare/trunk...ifesdjeen:12915-alternative]|[utest|http://cassci.datastax.com/job/ifesdjeen-12915-alternative-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/job/ifesdjeen-12915-alternative-testall/lastCompletedBuild/testReport/]



> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
>Assignee: Corentin Chary
> Fix For: 3.11.x, 4.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-03-08 Thread Corentin Chary (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901335#comment-15901335
 ] 

Corentin Chary commented on CASSANDRA-12915:


Looks good now, would be nice to see the results of the CI on this version :)

> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
>Assignee: Corentin Chary
> Fix For: 3.11.x, 4.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-03-08 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901028#comment-15901028
 ] 

Alex Petrov commented on CASSANDRA-12915:
-

bq. I think it would be better not to make the assumption that null range == 
empty range. Mostly because it isn't treated the same way in add()

Agree, fixed.

bq. That's not how intersection usually works

You're right. I was thinking to make it for both union and intersection cases. 
But since we use it only in the intersection case, we can flip the "emptiness" 
logic and still remove the check for empty. 

> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
>Assignee: Corentin Chary
> Fix For: 3.11.x, 4.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-03-08 Thread Corentin Chary (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900977#comment-15900977
 ] 

Corentin Chary commented on CASSANDRA-12915:


{code} this(range == null ? null : range.min, range == null ? null : range.max, 
range == null ? 0 : range.count);{code}

I think it would be better not to make the assumption that null range == empty 
range. Mostly because it isn't treated the same way in add()

{code} If either range is empty. Empty range is a subrange of (overlaps with) 
any range.{code}

That's not how intersection usually works, shouldn't the result of an empty 
range intersection with anything be an empty range ? (which means that an empty 
range overlaps with nothing)


> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
>Assignee: Corentin Chary
> Fix For: 3.11.x, 4.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-03-08 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15900948#comment-15900948
 ] 

Alex Petrov commented on CASSANDRA-12915:
-

I've looked at the code once again and turns out that we can't rely on disjoint 
for determining whether to return an empty iterator or no, since in case with 
union we would like to return just the iterators that produce results (empty 
ones won't produce any anyways) and in case with intersection, even though 
empty is overlapping with every set, we should make a distinction, since 
intersection with an empty iterator is empty. I have missed this yesterday and 
my tests were passing only by chance (since intersections were disjoint by 
themselves anyways).

I've addressed the issues 
[here|https://github.com/apache/cassandra/compare/trunk...ifesdjeen:12915-alternative].

A couple of comments on motivation:
  * a bit more tests to make sure we cover more cases
  * one of the problems revealed by new tests was that the original patch was 
yielding a bounce intersection iterator (which actually has min/max), but with 
empty range. Now we consistently return empty iterator that doesn't have min 
and max set. 
  * I wanted to avoid making a distinction for the first vs the rest ranges, 
mostly to use same code path
  * hopefully it became clearer when the empty iterator is going to be returned

Could you take another look at the patch and see if we have common ground here?
Thank you once again for clarifications and discussions: it's a complex 
problem, was hard to discover and isn't very simple to tackle from all sides 
simultaneously.

> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
>Assignee: Corentin Chary
> Fix For: 3.11.x, 4.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-03-06 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898877#comment-15898877
 ] 

Alex Petrov commented on CASSANDRA-12915:
-

Sure, what I'm saying is that behaviour is unnecessary to fix the problem that 
we have. SASI Range iterators are taking the shortest range.

For example, if we have records like 

|| a || b || c ||
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 3 | 1 |

And {{b}} and {{c}} are SASI-indexed, and we want to run the query {{WHERE b = 
2 AND c = 1}}, we will get 2 iterators, one with {{2}} (for the column {{b}}) 
and the second one with {{1,2,3}} (for the column {{c}}). Now, SASI will take 
the shortest range ({{b}}) and start iterating by "rewinding" the {{c}} 
iterator to the token {{2}}. If there's no item for the token, intersection 
will be empty. 

Now, moving closer to the problem. If we had just imbalanced iterators (one 
returning 100 results and the other just 1), SASI would be able to efficiently 
intersect ranges and hit the storage to retrieve just a single row. In the case 
with __empty__ results from one of the iterators, the one you're fixing, we 
were simply using a single iterator, and had to hit the storage a 100 times 
(assuming the other iterator returns 100 results). Now, with what I propose, we 
will hit is 0 times. Same with your approach, although with a lot of complex 
logic that I can not justify, so I have asked you to clarify. The trick was 
simply to "replace" that [null 
check|https://github.com/ifesdjeen/cassandra/commit/78b1ff630536b0f48787ced74a66d702d13637ba#diff-25d7f486e2818c56d6b01aa952d459f3L146]
 with an actual iterator, to let the intersection know there's a second part, 
and it's empty. I thought it was already discussed in [this 
comment|https://issues.apache.org/jira/browse/CASSANDRA-12915?focusedCommentId=15728412=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15728412]
 but I'm happy to re-iterate:

bq. The way RangeIterator work and how it's optimised, we're 
[picking|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/utils/RangeIntersectionIterator.java#L234-L237]
 the "shortest" token tree and start skipping the second token tree based on 
the results retrieved from the first one. So one of the indexes will be 
iterated anyways. The second index (since it's larger) might  have to 
fetch/load roughly 10 blocks (since they might be located far from one another 
on disk), but it never has to fetch all 2M items. It'll iterate only as many 
items as the smallest index has. For example, two index queries would skip 
through (left is which index is used, right is index of the item within the 
token tree):

and 

bq. Moreover, RangeIterator already has [different optimisation 
strategies|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/index/sasi/utils/RangeIntersectionIterator.java#L76-L78]
 based on differences in cardinality. I'd say that current benchmark shows that 
the query is slightly slower (since we have to go through around twice as much 
data on disk). But given the numbers at hand  the difference that is small and 
it's sub-millisecond, this optimisation seems not to pay the complexity that it 
brings.

That's one of the ideas behind SASI, pretty much: to be able to efficiently 
merge, iterate and skip iterators.

Looks like we're mostly on the same page. I've ran large-scale tests with a 
variant of the patch I've posted (160+GB of data), and it works exactly as 
expected.  Now, I want to simplify the patch and make sure we do not add the 
code we do not need.  

So there's no need to add additional logic to hide empty ranges, it's already 
handled. 

> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
>Assignee: Corentin Chary
> Fix For: 3.11.x, 4.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size 

[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-03-06 Thread Corentin Chary (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898148#comment-15898148
 ] 

Corentin Chary commented on CASSANDRA-12915:


Could you re-phrase the question ? I though I answered everything from [this 
comment|https://issues.apache.org/jira/browse/CASSANDRA-12915?focusedCommentId=15897393=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15897393]
 but it looks like I didn't.

The idea of my approach is that I'm looking for this behavior:

{code}
builder = RangeIntersectionIterator.builder(strategy);
builder.add(new LongIterator(new long[] {}));
builder.add(new LongIterator(new long[] {1}));
range = builder.build();

Assert.assertEquals(0, range.getCount());
Assert.assertFalse(range.hasNext()); // (optimized though isOverlapping() 
returning false
{code}

In other words, adding an empty iterator to a RangeIntersectionIterator should 
make it empty and there is a strong different between an empty and null 
iterator. I believe in your case your empty iterator will just get ignored 
because you need to remove this check: 
https://github.com/ifesdjeen/cassandra/blob/78b1ff630536b0f48787ced74a66d702d13637ba/src/java/org/apache/cassandra/index/sasi/utils/RangeIterator.java#L151

> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
>Assignee: Corentin Chary
> Fix For: 3.11.x, 4.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-03-06 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897907#comment-15897907
 ] 

Alex Petrov commented on CASSANDRA-12915:
-

This was a piece of test that remained from your code, the approach I suggest 
is slightly different, so I'm not surprised the empty range still counts as a 
range. 

My point is that we could have even taken out the null check from 
[here|https://github.com/ifesdjeen/cassandra/commit/78b1ff630536b0f48787ced74a66d702d13637ba#diff-25d7f486e2818c56d6b01aa952d459f3L146]
 and replaced it with an empty iterator and would still achieve the same result 
and get the issue fixed. 

If you disagree with the approach I'm proposing, you could answer my initial 
question and explain. I'm open, I just want to understand before we get it 
committed.

> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
>Assignee: Corentin Chary
> Fix For: 3.11.x, 4.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-03-06 Thread Corentin Chary (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897545#comment-15897545
 ] 

Corentin Chary commented on CASSANDRA-12915:


The fact that you didn't change the following line makes me thing that your 
patch doesn't really do what we need:
Assert.assertEquals(1L, builder.add(new LongIterator(new long[] 
{})).rangeCount());

Empty ranges really should not get ignored, and the changes made in 
https://github.com/ifesdjeen/cassandra/commit/78b1ff630536b0f48787ced74a66d702d13637ba#diff-22e58be2cfd42af959cb63c97de7eb3cR246
 show that the code do not behave like we would like it to.

> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
>Assignee: Corentin Chary
> Fix For: 3.11.x, 4.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-03-06 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897529#comment-15897529
 ] 

Alex Petrov commented on CASSANDRA-12915:
-

I've poked around a bit and could not see any behaviour change if we only 
introduce an empty iterator, like 
[here|https://github.com/ifesdjeen/cassandra/commit/78b1ff630536b0f48787ced74a66d702d13637ba]
 (this is by no means meant as a final version of the patch, only bringing it 
here as an example and a base for discussion). I'm just trying to understand 
the purpose of the rest of changes, it'd be good to hear your opinion.

> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
>Assignee: Corentin Chary
> Fix For: 3.11.x, 4.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-03-06 Thread Corentin Chary (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897430#comment-15897430
 ] 

Corentin Chary commented on CASSANDRA-12915:


* Removing ranges.isEmpty() happens in another function. Removing it doesn't 
change anything as forEach() will iterate on an empty list.
* True for min() and max(). It's this way for the switch() because computing 
min / max keys with an empty range doesn't make much sense.

Anything else ? If not I'll remove the duplicated code in min() and max()

> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
>Assignee: Corentin Chary
> Fix For: 3.11.x, 4.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-03-06 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897393#comment-15897393
 ] 

Alex Petrov commented on CASSANDRA-12915:
-

Thank you very much for the patch, I have a couple of questions:

  * 
[here|https://github.com/iksaif/cassandra/commit/4c8981a9b900a38dd67b3bb1560aaac7cc7ccfac#diff-88eaa3c77aa17a84006ad76f3151ed31L162],
 as far as I understand the code (and purpose), adding an empty range to 
current range doesn't change the range, why do you think we should remove 
{{ranges.isEmpty()}}? 
  * when comparing range sizes, there's a check for emptiness which looks 
redundant, for example 
[here|https://github.com/iksaif/cassandra/commit/4c8981a9b900a38dd67b3bb1560aaac7cc7ccfac#diff-88eaa3c77aa17a84006ad76f3151ed31R294]
 and 
[here|https://github.com/iksaif/cassandra/commit/4c8981a9b900a38dd67b3bb1560aaac7cc7ccfac#diff-88eaa3c77aa17a84006ad76f3151ed31R301]
 also, 
[here|https://github.com/iksaif/cassandra/commit/4c8981a9b900a38dd67b3bb1560aaac7cc7ccfac#diff-88eaa3c77aa17a84006ad76f3151ed31R255].
 

In other words, there's nothing special about empty ranges. We can just return 
an empty range instead of null and that's pretty much it.

> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: sasi
>Reporter: Corentin Chary
>Assignee: Corentin Chary
> Fix For: 3.11.x, 4.x
>
>
> It looks like RangeIntersectionIterator.java and be pretty inefficient in 
> some cases. Let's take the following query:
> SELECT data FROM table WHERE index1 = 'foo' AND index2 = 'bar';
> In this case:
> * index1 = 'foo' will match 2 items
> * index2 = 'bar' will match ~300k items
> On my setup, the query will take ~1 sec, most of the time being spent in 
> disk.TokenTree.getTokenAt().
> if I patch RangeIntersectionIterator so that it doesn't try to do the 
> intersection (and effectively only use 'index1') the query will run in a few 
> tenth of milliseconds.
> I see multiple solutions for that:
> * Add a static thresold to avoid the use of the index for the intersection 
> when we know it will be slow. Probably when the range size factor is very 
> small and the range size is big.
> * CASSANDRA-10765



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (CASSANDRA-12915) SASI: Index intersection with an empty range really inefficient

2017-02-17 Thread Corentin Chary (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871958#comment-15871958
 ] 

Corentin Chary commented on CASSANDRA-12915:


{code}
CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': '1'}  AND durable_writes = true;

CREATE TABLE test.test (
r text PRIMARY KEY,
a text,
b text,
c text,
data text
);

CREATE CUSTOM INDEX test_a_idx ON test.test (a) USING 
'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'analyzer_class': 
'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
'case_sensitive': 'true'};
CREATE CUSTOM INDEX test_c_idx ON test.test (c) USING 
'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'analyzer_class': 
'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
'case_sensitive': 'true'};
CREATE CUSTOM INDEX test_b_idx ON test.test (b) USING 
'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'analyzer_class': 
'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 
'case_sensitive': 'true'};
{code}

{code}
$ cat > generate.py
import sys
import random

def main(args):
n = int(args[1])

for i in xrange(n):
a = '0'
b = i % 10
c = i % (n / 10) + random.randint(0, 10)
print ("%d,%s,%d,%d,%d" % (i, a, b, c, i))

if __name__ == '__main__':
main(sys.argv)
$ python generate.py 200 > test.csv
{code}
{code}
COPY test.test FROM 'test.csv'  WITH MAXBATCHSIZE = 100 AND MAXATTEMPTS = 10 
AND MAXINSERTERRORS = 99;
{code}

{code}
cqlsh> SELECT * FROM test.test WHERE a = '1' AND c = '38151' LIMIT 1 ALLOW 
FILTERING;

 r | a | b | c | data
---+---+---+---+--

(0 rows)

Tracing session: fbc23200-f522-11e6-95df-69d39475f5a8

 activity   
   | 
timestamp  | source| source_elapsed | client
---++---++---

Execute CQL3 query | 
2017-02-17 16:08:48.288000 | 127.0.0.1 |  0 | 127.0.0.1
  Parsing SELECT * FROM test.test WHERE a = '1' 
AND c = '38151' LIMIT 1 ALLOW FILTERING; [Native-Transport-Requests-1] | 
2017-02-17 16:08:48.288000 | 127.0.0.1 |268 | 127.0.0.1

 Preparing statement [Native-Transport-Requests-1] | 
2017-02-17 16:08:48.289000 | 127.0.0.1 |513 | 127.0.0.1
 Index mean cardinalities are 
test_a_idx:-9223372036854775808,test_c_idx:-9223372036854775808. Scanning with 
test_a_idx. [Native-Transport-Requests-1] | 2017-02-17 16:08:48.289000 | 
127.0.0.1 |913 | 127.0.0.1

   Computing ranges to query [Native-Transport-Requests-1] | 
2017-02-17 16:08:48.289000 | 127.0.0.1 |   1027 | 127.0.0.1
Submitting range requests on 257 ranges with a concurrency of 1 
(-3.24259165E16 rows per range expected) [Native-Transport-Requests-1] | 
2017-02-17 16:08:48.289001 | 127.0.0.1 |   1319 | 127.0.0.1

   Submitted 1 concurrent range requests [Native-Transport-Requests-1] | 
2017-02-17 16:08:48.29 | 127.0.0.1 |   2229 | 127.0.0.1

  Executing read on test.test using index test_a_idx [ReadStage-3] | 
2017-02-17 16:08:48.292000 | 127.0.0.1 |   3494 | 127.0.0.1

   Read 0 live and 0 tombstone cells [ReadStage-3] | 
2017-02-17 16:08:48.293000 | 127.0.0.1 |   4694 | 127.0.0.1

  Request complete | 
2017-02-17 16:08:48.292930 | 127.0.0.1 |   4930 | 127.0.0.1
{code}


Yay ! No more iterating on the useless index.

Patch is on https://github.com/iksaif/cassandra/tree/sasi-null-intersect


> SASI: Index intersection with an empty range really inefficient
> ---
>
> Key: CASSANDRA-12915
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12915
> Project: Cassandra
>  Issue Type: Improvement
>  Components: