Jon Haddad created CASSANDRA-11194:
--------------------------------------
Summary: materialized views - support explode() on collections
Key: CASSANDRA-11194
URL: https://issues.apache.org/jira/browse/CASSANDRA-11194
Project: Cassandra
Issue Type: New Feature
Reporter: Jon Haddad
I'm working on a database design to model a product catalog. Products can
belong to categories. Categories can belong to multiple sub categories (think
about Amazon's complex taxonomies).
My category table would look like this, giving me individual categories & their
parents:
{code}
CREATE TABLE category (
category_id uuid primary key,
name text,
parents set<uuid>
);
{code}
To get a list of all the children of a particular category, I need a table that
looks like the following:
{code}
CREATE TABLE categories_by_parent (
parent_id uuid,
category_id uuid,
name text,
primary key (parent_id, category_id)
);
{code}
The important thing to note here is that a single category can have multiple
parents.
I'd like to propose support for collections in materialized views via an
explode() function that would create 1 row per item in the collection. For
instance, I'll insert the following 3 rows (2 parents, 1 child) into the
category table:
{code}
insert into category (category_id, name, parents) values
(009fe0e1-5b09-4efc-a92d-c03720324a4f, 'Parent', null);
insert into category (category_id, name, parents) values
(1f2914de-0adf-4afc-b7ad-ddd8dc876ab1, 'Parent2', null);
insert into category (category_id, name, parents) values
(1f93bc07-9874-42a5-a7d1-b741dc9c509c, 'Child',
{009fe0e1-5b09-4efc-a92d-c03720324a4f, 1f2914de-0adf-4afc-b7ad-ddd8dc876ab1 });
cqlsh:test> select * from category;
category_id | name | parents
--------------------------------------+---------+------------------------------------------------------------------------------
009fe0e1-5b09-4efc-a92d-c03720324a4f | Parent |
null
1f2914de-0adf-4afc-b7ad-ddd8dc876ab1 | Parent2 |
null
1f93bc07-9874-42a5-a7d1-b741dc9c509c | Child |
{009fe0e1-5b09-4efc-a92d-c03720324a4f, 1f2914de-0adf-4afc-b7ad-ddd8dc876ab1}
(3 rows)
{code}
Given the following CQL to select the child category, utilizing an explode
function, I would expect to get back 2 rows, 1 for each parent:
{code}
select category_id, name, explode(parents) as parent_id from category where
category_id = 1f93bc07-9874-42a5-a7d1-b741dc9c509c;
category_id | name | parent_id
--------------------------------------+-------+--------------------------------------
1f93bc07-9874-42a5-a7d1-b741dc9c509c | Child |
009fe0e1-5b09-4efc-a92d-c03720324a4f
1f93bc07-9874-42a5-a7d1-b741dc9c509c | Child |
1f2914de-0adf-4afc-b7ad-ddd8dc876ab1
(2 rows)
{code}
This functionality would ideally apply to materialized views, since the ability
to control partitioning here would allow us to efficiently query our MV for all
categories belonging to a parent in a complex taxonomy.
{code}
CREATE MATERIALIZED VIEW categories_by_parent as
SELECT explode(parents) as parent_id,
category_id, name FROM category WHERE parents IS NOT NULL
{code}
The explode() function is available in Spark Dataframes and my proposed
function has the same behavior:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)