[
https://issues.apache.org/jira/browse/CASSANDRA-8131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Catalin Alexandru Zamfir updated CASSANDRA-8131:
------------------------------------------------
Description:
After watching Jonathan's 2014 summit video, I wanted to give collection
indexes a try as they seem to be a fit for a "search by key/values" usage
pattern we have in our setup. Doing some test queries that I expect users would
do against the table, a short-circuit behavior came up:
Here's the whole transcript:
{noformat}
CREATE TABLE by_sets (id int PRIMARY KEY, datakeys set<text>, datavars
set<text>);
CREATE INDEX by_sets_datakeys ON by_sets (datakeys);
CREATE INDEX by_sets_datavars ON by_sets (datavars);
INSERT INTO by_sets (id, datakeys, datavars) VALUES (1, {'a'}, {'b'});
INSERT INTO by_sets (id, datakeys, datavars) VALUES (2, {'c'}, {'d'});
INSERT INTO by_sets (id, datakeys, datavars) VALUES (3, {'e'}, {'f'});
INSERT INTO by_sets (id, datakeys, datavars) VALUES (4, {'a'}, {'z'});
SELECT * FROM by_sets;
####
id | datakeys | datavars
----+----------+----------
1 | {'a'} | {'b'}
2 | {'c'} | {'d'}
4 | {'a'} | {'z'}
3 | {'e'} | {'f'}
####
{noformat}
We then tried this query which short-circuited:
{noformat}
SELECT * FROM by_sets WHERE datakeys CONTAINS 'a' AND datakeys CONTAINS 'c';
####
id | datakeys | datavars
----+----------+----------
1 | {'a'} | {'b'}
4 | {'a'} | {'z'}
(2 rows)
####
{noformat}
Instead of receveing 3 rows, which match the datakeys CONTAINS 'a' AND datakeys
CONTAINS 'c' we only got the first.
Doing the same, but with CONTAINS 'c' first, ignores the second AND.
{noformat}
#> SELECT * FROM by_sets WHERE datakeys CONTAINS 'c' AND datakeys CONTAINS 'a' ;
####
id | datakeys | datavars
----+----------+----------
2 | {'c'} | {'d'}
(1 rows)
####
{noformat}
Also, on a side-note, I have two indexes on both datakeys and datavars. But
when trying to run a query such as:
{noformat}
#> select * from by_sets WHERE datakeys CONTAINS 'a' AND datavars CONTAINS 'z';
code=2200 [Invalid query] message="Cannot execute this query as it might
involve data filtering and thus may have unpredictable performance.
If you want to execute this query despite the performance unpredictability, use
ALLOW FILTERING"
{noformat}
The second column, after AND (even if I inverse the order) requires an "allow
filtering" clause yet the column is indexed an an in-memory "join" of the
primary keys of these sets on the coordinator could build up the result.
Could anyone explain the short-circuit behavior?
And the requirement for "allow-filtering" on a secondly indexed column?
If they're not bugs but intended they should be documented better, at least
their limitations.
was:
After watching Jonathan's 2014 summit video, I wanted to give collection
indexes a try as they seem to be a fit for a "search by key/values" usage
pattern we have in our setup. Doing some test queries that I expect users would
do against the table, a short-circuit behavior came up:
Here's the whole transcript:
{noformat}
create table by_sets (id int PRIMARY KEY, datakeys set<text>, datavars
set<text>);
CREATE INDEX by_sets_datakeys ON by_sets (datakeys);
CREATE INDEX by_sets_datavars ON by_sets (datavars);
insert into by_sets (id, datakeys, datavars) values (1, {'a'}, {'b'});
insert into by_sets (id, datakeys, datavars) values (2, {'c'}, {'d'});
insert into by_sets (id, datakeys, datavars) values (3, {'e'}, {'f'});
insert into by_sets (id, datakeys, datavars) values (4, {'a'}, {'z'});
select * from by_sets;
####
id | datakeys | datavars
----+----------+----------
1 | {'a'} | {'b'}
2 | {'c'} | {'d'}
4 | {'a'} | {'z'}
3 | {'e'} | {'f'}
####
{noformat}
We then tried this query which short-circuited:
{noformat}
select * from by_sets WHERE datakeys cONTAINS 'a' AND datakeys CONTAINS 'c';
####
id | datakeys | datavars
----+----------+----------
1 | {'a'} | {'b'}
4 | {'a'} | {'z'}
(2 rows)
####
{noformat}
Instead of receveing 3 rows, which match the datakeys CONTAINS 'a' AND datakeys
CONTAINS 'c' we only got the first.
Doing the same, but with CONTAINS 'c' first, ignores the second AND.
{noformat}
#> select * from by_sets WHERE datakeys cONTAINS 'c' AND datakeys CONTAINS 'a' ;
####
id | datakeys | datavars
----+----------+----------
2 | {'c'} | {'d'}
(1 rows)
####
{noformat}
Also, on a side-note, I have two indexes on both datakeys and datavars. But
when trying to run a query such as:
{noformat}
#> select * from by_sets WHERE datakeys CONTAINS 'a' AND datavars CONTAINS 'z';
code=2200 [Invalid query] message="Cannot execute this query as it might
involve data filtering and thus may have unpredictable performance. If you want
to execute this query despite the performance unpredictability, use ALLOW
FILTERING"
{noformat}
The second column, after AND (even if I inverse the order) requires an "allow
filtering" clause yet the column is indexed an an in-memory "join" of the
primary keys of these sets on the coordinator could build up the result.
Could anyone explain the short-circuit behavior?
And the requirement for "allow-filtering" on a secondly indexed column?
If they're not bugs but intended they should be documented better, at least
their limitations.
> Short-circuited query results from collection index query
> ---------------------------------------------------------
>
> Key: CASSANDRA-8131
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8131
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Environment: Debian Wheezy, Oracle JDK, Cassandra 2.1
> Reporter: Catalin Alexandru Zamfir
> Labels: collections, cql3, cqlsh, query, queryparser
> Fix For: 2.1.0
>
>
> After watching Jonathan's 2014 summit video, I wanted to give collection
> indexes a try as they seem to be a fit for a "search by key/values" usage
> pattern we have in our setup. Doing some test queries that I expect users
> would do against the table, a short-circuit behavior came up:
> Here's the whole transcript:
> {noformat}
> CREATE TABLE by_sets (id int PRIMARY KEY, datakeys set<text>, datavars
> set<text>);
> CREATE INDEX by_sets_datakeys ON by_sets (datakeys);
> CREATE INDEX by_sets_datavars ON by_sets (datavars);
> INSERT INTO by_sets (id, datakeys, datavars) VALUES (1, {'a'}, {'b'});
> INSERT INTO by_sets (id, datakeys, datavars) VALUES (2, {'c'}, {'d'});
> INSERT INTO by_sets (id, datakeys, datavars) VALUES (3, {'e'}, {'f'});
> INSERT INTO by_sets (id, datakeys, datavars) VALUES (4, {'a'}, {'z'});
> SELECT * FROM by_sets;
> ####
> id | datakeys | datavars
> ----+----------+----------
> 1 | {'a'} | {'b'}
> 2 | {'c'} | {'d'}
> 4 | {'a'} | {'z'}
> 3 | {'e'} | {'f'}
> ####
> {noformat}
> We then tried this query which short-circuited:
> {noformat}
> SELECT * FROM by_sets WHERE datakeys CONTAINS 'a' AND datakeys CONTAINS 'c';
> ####
> id | datakeys | datavars
> ----+----------+----------
> 1 | {'a'} | {'b'}
> 4 | {'a'} | {'z'}
> (2 rows)
> ####
> {noformat}
> Instead of receveing 3 rows, which match the datakeys CONTAINS 'a' AND
> datakeys CONTAINS 'c' we only got the first.
> Doing the same, but with CONTAINS 'c' first, ignores the second AND.
> {noformat}
> #> SELECT * FROM by_sets WHERE datakeys CONTAINS 'c' AND datakeys CONTAINS
> 'a' ;
> ####
> id | datakeys | datavars
> ----+----------+----------
> 2 | {'c'} | {'d'}
> (1 rows)
> ####
> {noformat}
> Also, on a side-note, I have two indexes on both datakeys and datavars. But
> when trying to run a query such as:
> {noformat}
> #> select * from by_sets WHERE datakeys CONTAINS 'a' AND datavars CONTAINS
> 'z';
> code=2200 [Invalid query] message="Cannot execute this query as it might
> involve data filtering and thus may have unpredictable performance.
> If you want to execute this query despite the performance unpredictability,
> use ALLOW FILTERING"
> {noformat}
> The second column, after AND (even if I inverse the order) requires an "allow
> filtering" clause yet the column is indexed an an in-memory "join" of the
> primary keys of these sets on the coordinator could build up the result.
> Could anyone explain the short-circuit behavior?
> And the requirement for "allow-filtering" on a secondly indexed column?
> If they're not bugs but intended they should be documented better, at least
> their limitations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)