[jira] [Commented] (CASSANDRA-7643) Cassandra Schema Template

Brandon Williams (JIRA) Tue, 29 Jul 2014 17:03:00 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078654#comment-14078654
 ]


Brandon Williams commented on CASSANDRA-7643:
---------------------------------------------

I am generally -1 on anything that encourages or gives users the idea that that 
many column families is a good idea.

> Cassandra Schema Template 
> --------------------------
>
>                 Key: CASSANDRA-7643
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7643
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Cheng Ren
>            Priority: Minor
>         Attachments: patch.diff
>
>
> Cassandra schema change is the performance painpoint for us, since it's the 
> global information across the entire cluster. Our production cassandra 
> cluster consists of a lot of  sets of column families, which totals 1000 
> column families, and 38301 columns, which sum up to 3.2MB.
> We have a data model where the primary key is split into two parts K1 , K2. 
> Lets say the cardinality of set K1 is small. We also have a constraint that 
> we frequently want to scan all rows that belong to a particular value of K1. 
> In this case cassandra offers two possible solutions.
> 1) Create a single table with a composite key (K1, K2)
> 2) Create a table per K1, with primary key as K2
> In option #1: The number of tables is only 1, however we lose the ability to 
> easily scan all rows in K1= X without paying the penalty of reading all rows 
> in the table.
> Option #2 : gives us the freedom to scan only a particular value of K1. 
> However it leads to  significant potentially unbounded increase in # of 
> tables. However if the size of set (K1) is relatively small , this is a 
> feasible option with a cleaner data interface.
> An example of this data model is where we have a set of merchants with 
> products. Then K1 = merchant_id and K2 = product Id. The number of merchants 
> is still very small compared to # of products. 
> Option #2 is our solution since size of set k1 for us is relatively small, 
> but also creates a fair amount of tables per K1 which have exactly same 
> columns and metadata, so whenever we need to add/drop one attribute for all 
> of our tables per K1, it puts a lot of loads on the entire cluster, and all 
> backend pipelines will be affected, or even have to be shutdown to 
> accommodate the schema change.
> To reduce the load of this kind of schema change,  we came up with a new 
> feature called "template".  We can create a template, and then create tables 
> with that template. 
> ex: 
> {code}
> create template template_table ( block_id text, PRIMARY KEY (block_id));
> create table table_a, table_b, table_c with template_table;
> {code}
> This allows us to reduce the time of metadata gossip. Moreover, when we need 
> to add one more attribute for all of our merchant, we just need to alter 
> template:
> {code}
> alter template template_table add foo text;
> {code}
> which also alters table_a, table_b, table_c.
> We changed the system keyspace a bit to accommodate the template feature:
> schema_columnfamilies only stores the metadata of template and non-templated 
> column families.
> schema_columns only stores the column info of template and non-templated cfs.
> and we added a new table in system keyspace called 
> schema_columnfamilies_templated,
> which manages the mapping relationship between template and templated cfs.
> So like this:
> schema_columnfamilies_templated:
> keyspace, columnfamily_name, template_name
> XXX,         table_a,                 template_table
> XXX,         table_b,                 template_table
> XXX,         table_c,                 template_table
> We already have some performance results in our 15-node cluster. Normally 
> creating 400 tables takes more than hours for all the migration stage tasks 
> to complete , but if we create 400 tables with templates, it just takes 1 to 
> 2 seconds. It also works great for alter table.  
> We believe what we're proposing here can be very useful for other people in 
> the Cassandra community as well. Attached is our proposed patch for the 
> template schema feature. Is it possible for the community to consider 
> accepting this patch in the main branch of latest Cassandra? Or, would you 
> mind providing us feedbacks? Please let us know if you have any concerns or 
> suggestions regarding the change.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7643) Cassandra Schema Template

Reply via email to