[jira] [Updated] (CASSANDRA-13857) Allow MV with same PK but a different filtering

Kurt Greaves (JIRA) Thu, 07 Sep 2017 20:35:24 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kurt Greaves updated CASSANDRA-13857:
-------------------------------------
    Description: 
We currently disallow creation of a view that has the exact same primary key as 
the base where no clustering keys are present, however a potential use case 
would be a view where part of the PK is filtered so as to have a subset of data 
in the view which is faster for range queries. We actually currently allow 
this, but only if you have a clustering key defined. If you only have a 
partitioning key it's not possible.

>From the mailing list, the below example works:
{code:java}
CREATE TABLE users (
  site_id int,
  user_id text,
  n int,
  data set<frozen<text>>,
  PRIMARY KEY ((site_id, user_id), n));

user data is updated and read by PK and sometimes I have to fetch all user for 
some specific site_id. It appeared that full scan by token(site_id,user_id) 
filtered by WHERE site_id = <some id> works much slower than unfiltered full 
scan on
CREATE MATERIALIZED VIEW users_1 AS
SELECT site_id, user_id, n, data
FROM users
WHERE site_id = 1 AND user_id IS NOT NULL AND n IS NOT NULL
PRIMARY KEY ((site_id, user_id), n);
{code}

However the following does not:
{code:java}
CREATE TABLE users (
site_id int,
user_id text,
data set<text>,
PRIMARY KEY ((site_id, user_id)));

CREATE MATERIALIZED VIEW users_1 AS
SELECT site_id, user_id, data
FROM users
WHERE site_id = 1 AND user_id IS NOT NULL 
PRIMARY KEY ((site_id, user_id));
InvalidRequest: Error from server: code=2200 [Invalid query] message="No 
columns are defined for Materialized View other than primary key"
{code}

This is because if the clustering key is empty we assume they've only defined 
the primary key in the partition key and we haven't accounted for this use 
case. 

On that note, we also don't allow the following narrowing of the partition key:
{code}
CREATE TABLE kurt.base (
    id int,
    uid text,
    data text,
    PRIMARY KEY (id, uid)
) 

CREATE MATERIALIZED VIEW kurt.mv2 AS SELECT * from kurt.base where id IS NOT 
NULL and uid='1' PRIMARY KEY ((id, uid));
{code}
But we do allow the following, which works because there is still a clustering 
key, despite not changing the PK.
{code}
CREATE MATERIALIZED VIEW kurt.mv2 AS SELECT * from kurt.base where id IS NOT 
NULL and uid='1' PRIMARY KEY (id, uid);
{code}
And we also allow the following, which is a narrowing of the partition key as 
above
{code}
create table kurt.base3 (id int, uid int, clus1 int, clus2 int, data text, 
PRIMARY KEY ((id, uid), clus1, clus2));

CREATE MATERIALIZED VIEW kurt.mv4 AS SELECT * from kurt.base3 where id IS NOT 
NULL and uid IS NOT NULL and clus1 IS NOT NULL AND clus2 IS NOT NULL  PRIMARY 
KEY ((id, uid, clus1), clus2);
{code}

I _think_ supporting these cases is trivial and mostly already handled in the 
underlying MV write path, so we might be able to get away with just a simple 
change of [this 
condition|https://github.com/apache/cassandra/blob/83822d12d87dcb3aaad2b1e670e57ebef4ab1c36/src/java/org/apache/cassandra/cql3/statements/CreateViewStatement.java#L291].


  was:
We currently disallow creation of a view that has the exact same primary key as 
the base where no clustering keys are present, however a potential use case 
would be a view where part of the PK is filtered so as to have a subset of data 
in the view which is faster for range queries. We actually currently allow 
this, but only if you have a clustering key defined. If you only have a 
partitioning key it's not possible.

>From the mailing list, the below example works:
{code:java}
CREATE TABLE users (
  site_id int,
  user_id text,
  n int,
  data set<frozen<text>>,
  PRIMARY KEY ((site_id, user_id), n));

user data is updated and read by PK and sometimes I have to fetch all user for 
some specific site_id. It appeared that full scan by token(site_id,user_id) 
filtered by WHERE site_id = <some id> works much slower than unfiltered full 
scan on
CREATE MATERIALIZED VIEW users_1 AS
SELECT site_id, user_id, n, data
FROM users
WHERE site_id = 1 AND user_id IS NOT NULL AND n IS NOT NULL
PRIMARY KEY ((site_id, user_id), n);
{code}

However the following does not:
{code:java}
CREATE TABLE users (
site_id int,
user_id text,
data set<text>,
PRIMARY KEY ((site_id, user_id)));

CREATE MATERIALIZED VIEW users_1 AS
SELECT site_id, user_id, data
FROM users
WHERE site_id = 1 AND user_id IS NOT NULL 
PRIMARY KEY ((site_id, user_id));
{code}

This is because if the clustering key is empty we assume they've only defined 
the primary key in the partition key and we haven't accounted for this use 
case. 

On that note, we also don't allow the following narrowing of the partition key:
{code}
CREATE TABLE kurt.base (
    id int,
    uid text,
    data text,
    PRIMARY KEY (id, uid)
) 

CREATE MATERIALIZED VIEW kurt.mv2 AS SELECT * from kurt.base where id IS NOT 
NULL and uid='1' PRIMARY KEY ((id, uid));
{code}
But we do allow the following, which works because there is still a clustering 
key, despite not changing the PK.
{code}
CREATE MATERIALIZED VIEW kurt.mv2 AS SELECT * from kurt.base where id IS NOT 
NULL and uid='1' PRIMARY KEY (id, uid);
{code}
And we also allow the following, which is a narrowing of the partition key as 
above
{code}
create table kurt.base3 (id int, uid int, clus1 int, clus2 int, data text, 
PRIMARY KEY ((id, uid), clus1, clus2));

CREATE MATERIALIZED VIEW kurt.mv4 AS SELECT * from kurt.base3 where id IS NOT 
NULL and uid IS NOT NULL and clus1 IS NOT NULL AND clus2 IS NOT NULL  PRIMARY 
KEY ((id, uid, clus1), clus2);
{code}

I _think_ supporting these cases is trivial and mostly already handled in the 
underlying MV write path, so we might be able to get away with just a simple 
change of [this 
condition|https://github.com/apache/cassandra/blob/83822d12d87dcb3aaad2b1e670e57ebef4ab1c36/src/java/org/apache/cassandra/cql3/statements/CreateViewStatement.java#L291].



> Allow MV with same PK but a different filtering
> -----------------------------------------------
>
>                 Key: CASSANDRA-13857
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13857
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Kurt Greaves
>
> We currently disallow creation of a view that has the exact same primary key 
> as the base where no clustering keys are present, however a potential use 
> case would be a view where part of the PK is filtered so as to have a subset 
> of data in the view which is faster for range queries. We actually currently 
> allow this, but only if you have a clustering key defined. If you only have a 
> partitioning key it's not possible.
> From the mailing list, the below example works:
> {code:java}
> CREATE TABLE users (
>   site_id int,
>   user_id text,
>   n int,
>   data set<frozen<text>>,
>   PRIMARY KEY ((site_id, user_id), n));
> user data is updated and read by PK and sometimes I have to fetch all user 
> for some specific site_id. It appeared that full scan by 
> token(site_id,user_id) filtered by WHERE site_id = <some id> works much 
> slower than unfiltered full scan on
> CREATE MATERIALIZED VIEW users_1 AS
> SELECT site_id, user_id, n, data
> FROM users
> WHERE site_id = 1 AND user_id IS NOT NULL AND n IS NOT NULL
> PRIMARY KEY ((site_id, user_id), n);
> {code}
> However the following does not:
> {code:java}
> CREATE TABLE users (
> site_id int,
> user_id text,
> data set<text>,
> PRIMARY KEY ((site_id, user_id)));
> CREATE MATERIALIZED VIEW users_1 AS
> SELECT site_id, user_id, data
> FROM users
> WHERE site_id = 1 AND user_id IS NOT NULL 
> PRIMARY KEY ((site_id, user_id));
> InvalidRequest: Error from server: code=2200 [Invalid query] message="No 
> columns are defined for Materialized View other than primary key"
> {code}
> This is because if the clustering key is empty we assume they've only defined 
> the primary key in the partition key and we haven't accounted for this use 
> case. 
> On that note, we also don't allow the following narrowing of the partition 
> key:
> {code}
> CREATE TABLE kurt.base (
>     id int,
>     uid text,
>     data text,
>     PRIMARY KEY (id, uid)
> ) 
> CREATE MATERIALIZED VIEW kurt.mv2 AS SELECT * from kurt.base where id IS NOT 
> NULL and uid='1' PRIMARY KEY ((id, uid));
> {code}
> But we do allow the following, which works because there is still a 
> clustering key, despite not changing the PK.
> {code}
> CREATE MATERIALIZED VIEW kurt.mv2 AS SELECT * from kurt.base where id IS NOT 
> NULL and uid='1' PRIMARY KEY (id, uid);
> {code}
> And we also allow the following, which is a narrowing of the partition key as 
> above
> {code}
> create table kurt.base3 (id int, uid int, clus1 int, clus2 int, data text, 
> PRIMARY KEY ((id, uid), clus1, clus2));
> CREATE MATERIALIZED VIEW kurt.mv4 AS SELECT * from kurt.base3 where id IS NOT 
> NULL and uid IS NOT NULL and clus1 IS NOT NULL AND clus2 IS NOT NULL  PRIMARY 
> KEY ((id, uid, clus1), clus2);
> {code}
> I _think_ supporting these cases is trivial and mostly already handled in the 
> underlying MV write path, so we might be able to get away with just a simple 
> change of [this 
> condition|https://github.com/apache/cassandra/blob/83822d12d87dcb3aaad2b1e670e57ebef4ab1c36/src/java/org/apache/cassandra/cql3/statements/CreateViewStatement.java#L291].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-13857) Allow MV with same PK but a different filtering

Reply via email to