Brian Hess created CASSANDRA-9415:
--------------------------------------

             Summary: Implicit use of Materialized Views on SELECT
                 Key: CASSANDRA-9415
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9415
             Project: Cassandra
          Issue Type: Improvement
            Reporter:  Brian Hess


CASSANDRA-6477 introduces Materialized Views.  This greatly simplifies the 
write path for the best-practice of "query tables".  But it does not simplify 
the read path as much as our users want/need.

We suggest to folks to create multiple copies of their base table optimized for 
certain queries - hence "query table".  For example, we may have a USER table 
with two type of queries: lookup by userid and lookup by email address.  We 
would recommend creating 2 tables USER_BY_USERID and USER_BY_EMAIL.  Both would 
have the exact same schema, with the same PRIMARY KEY columns, but different 
PARTITION KEY - the first would be USERID and the second would be EMAIL.

One complicating thing with this approach is that the application now needs to 
know that when it INSERT/UPDATE/DELETEs from the base table it needs to 
INSERT/UPDATE/DELETE from all of the query tables as well.  CASSANDRA-6477 
covers this nicely.

However, the other side of the coin is that the application needs to know which 
query table to leverage based on the selection criteria.  Using the example 
above, if the query has a predicate such as "WHERE userid = 'bhess'", then 
USERS_BY_USERID is the better table to use.  Similarly, when the predicate is 
"WHERE email = '[email protected]'", USERS_BY_EMAIL is appropriate.

On INSERT/UPDATE/DELETE, Materialized Views essentially give a single "name" to 
the collection of tables.  You do operations just on the base table.  It is 
very attractive for the SELECT side as well.  It would be very good to allow an 
application to simply do "SELECT * FROM users WHERE userid = 'bhess'" and have 
that query implicitly leverage the USERS_BY_USERID materialized view.

For additional use cases, especially analytics use cases like in Spark, this 
allows the Spark code to simply push down the query without having to know 
about all of the MVs that have been set up.  The system will route the query 
appropriately.  And if additional MVs are necessary to make a query run 
better/faster, then those MVs can be set up and Spark will implicitly leverage 
them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to