[
https://issues.apache.org/jira/browse/CASSANDRA-7643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078654#comment-14078654
]
Brandon Williams commented on CASSANDRA-7643:
---------------------------------------------
I am generally -1 on anything that encourages or gives users the idea that that
many column families is a good idea.
> Cassandra Schema Template
> --------------------------
>
> Key: CASSANDRA-7643
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7643
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Cheng Ren
> Priority: Minor
> Attachments: patch.diff
>
>
> Cassandra schema change is the performance painpoint for us, since it's the
> global information across the entire cluster. Our production cassandra
> cluster consists of a lot of sets of column families, which totals 1000
> column families, and 38301 columns, which sum up to 3.2MB.
> We have a data model where the primary key is split into two parts K1 , K2.
> Lets say the cardinality of set K1 is small. We also have a constraint that
> we frequently want to scan all rows that belong to a particular value of K1.
> In this case cassandra offers two possible solutions.
> 1) Create a single table with a composite key (K1, K2)
> 2) Create a table per K1, with primary key as K2
> In option #1: The number of tables is only 1, however we lose the ability to
> easily scan all rows in K1= X without paying the penalty of reading all rows
> in the table.
> Option #2 : gives us the freedom to scan only a particular value of K1.
> However it leads to significant potentially unbounded increase in # of
> tables. However if the size of set (K1) is relatively small , this is a
> feasible option with a cleaner data interface.
> An example of this data model is where we have a set of merchants with
> products. Then K1 = merchant_id and K2 = product Id. The number of merchants
> is still very small compared to # of products.
> Option #2 is our solution since size of set k1 for us is relatively small,
> but also creates a fair amount of tables per K1 which have exactly same
> columns and metadata, so whenever we need to add/drop one attribute for all
> of our tables per K1, it puts a lot of loads on the entire cluster, and all
> backend pipelines will be affected, or even have to be shutdown to
> accommodate the schema change.
> To reduce the load of this kind of schema change, we came up with a new
> feature called "template". We can create a template, and then create tables
> with that template.
> ex:
> {code}
> create template template_table ( block_id text, PRIMARY KEY (block_id));
> create table table_a, table_b, table_c with template_table;
> {code}
> This allows us to reduce the time of metadata gossip. Moreover, when we need
> to add one more attribute for all of our merchant, we just need to alter
> template:
> {code}
> alter template template_table add foo text;
> {code}
> which also alters table_a, table_b, table_c.
> We changed the system keyspace a bit to accommodate the template feature:
> schema_columnfamilies only stores the metadata of template and non-templated
> column families.
> schema_columns only stores the column info of template and non-templated cfs.
> and we added a new table in system keyspace called
> schema_columnfamilies_templated,
> which manages the mapping relationship between template and templated cfs.
> So like this:
> schema_columnfamilies_templated:
> keyspace, columnfamily_name, template_name
> XXX, table_a, template_table
> XXX, table_b, template_table
> XXX, table_c, template_table
> We already have some performance results in our 15-node cluster. Normally
> creating 400 tables takes more than hours for all the migration stage tasks
> to complete , but if we create 400 tables with templates, it just takes 1 to
> 2 seconds. It also works great for alter table.
> We believe what we're proposing here can be very useful for other people in
> the Cassandra community as well. Attached is our proposed patch for the
> template schema feature. Is it possible for the community to consider
> accepting this patch in the main branch of latest Cassandra? Or, would you
> mind providing us feedbacks? Please let us know if you have any concerns or
> suggestions regarding the change.
--
This message was sent by Atlassian JIRA
(v6.2#6252)