[cassandra] branch trunk updated: Added intro to data modeling section.

rustyrazorblade Wed, 04 Mar 2020 11:47:24 -0800

This is an automated email from the ASF dual-hosted git repository.

rustyrazorblade pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git



The following commit(s) were added to refs/heads/trunk by this push:
     new 4585bbd  Added intro to data modeling section.
4585bbd is described below

commit 4585bbd131aaab367ceee6833ca1e8287724a5b1
Author: dvohra <[email protected]>
AuthorDate: Wed Mar 4 11:37:56 2020 +0000

    Added intro to data modeling section.
    
    Patch by dvohra; Reviewed by Erick Ramirez and Jon Haddad for 
CASSANDRA-15481
---
 CHANGES.txt                                        |   1 +
 .../data_modeling/images/Figure_1_data_model.jpg   | Bin 0 -> 17469 bytes
 .../data_modeling/images/Figure_2_data_model.jpg   | Bin 0 -> 20925 bytes
 doc/source/data_modeling/index.rst                 |   1 +
 doc/source/data_modeling/intro.rst                 | 146 +++++++++++++++++++++
 5 files changed, 148 insertions(+)

diff --git a/CHANGES.txt b/CHANGES.txt
index 072b7c3..a07ff91 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 4.0-alpha4
+ * Add data modeling introduction (CASSANDRA-15481)
  * Improve the algorithmic token allocation in case racks = RF 
(CASSANDRA-15600)
  * Fix ConnectionTest.testAcquireReleaseOutbound (CASSANDRA-15308)
  * Include finalized pending sstables in preview repair (CASSANDRA-15553)
diff --git a/doc/source/data_modeling/images/Figure_1_data_model.jpg 
b/doc/source/data_modeling/images/Figure_1_data_model.jpg
new file mode 100644
index 0000000..a3b330e
Binary files /dev/null and 
b/doc/source/data_modeling/images/Figure_1_data_model.jpg differ
diff --git a/doc/source/data_modeling/images/Figure_2_data_model.jpg 
b/doc/source/data_modeling/images/Figure_2_data_model.jpg
new file mode 100644
index 0000000..7acdeac
Binary files /dev/null and 
b/doc/source/data_modeling/images/Figure_2_data_model.jpg differ
diff --git a/doc/source/data_modeling/index.rst 
b/doc/source/data_modeling/index.rst
index f01c92c..2f799dc 100644
--- a/doc/source/data_modeling/index.rst
+++ b/doc/source/data_modeling/index.rst
@@ -20,6 +20,7 @@ Data Modeling
 .. toctree::
    :maxdepth: 2
 
+   intro
    data_modeling_conceptual
    data_modeling_rdbms
    data_modeling_queries
diff --git a/doc/source/data_modeling/intro.rst 
b/doc/source/data_modeling/intro.rst
new file mode 100644
index 0000000..630a7d1
--- /dev/null
+++ b/doc/source/data_modeling/intro.rst
@@ -0,0 +1,146 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+..
+..     http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS,
+.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+.. See the License for the specific language governing permissions and
+.. limitations under the License.
+
+Introduction
+============
+
+Apache Cassandra stores data in tables, with each table consisting of rows and 
columns. CQL (Cassandra Query Language) is used to query the data stored in 
tables. Apache Cassandra data model is based around and optimized for querying. 
Cassandra does not support relational data modeling intended for relational 
databases.
+
+What is Data Modeling?
+^^^^^^^^^^^^^^^^^^^^^^
+
+Data modeling is the process of identifying entities and their relationships. 
In relational databases, data is placed in normalized tables with foreign keys 
used to reference related data in other tables. Queries that the application 
will make are driven by the structure of the tables and related data are 
queried as table joins.
+
+In Cassandra, data modeling is query-driven. The data access patterns and 
application queries determine the structure and organization of data which then 
used to design the database tables.
+
+Data is modeled around specific queries. Queries are best designed to access a 
single table, which implies that all entities involved in a query must be in 
the same table to make data access (reads) very fast. Data is modeled to best 
suit a query or a set of queries. A table could have one or more entities as 
best suits a query. As entities do typically have relationships among them and 
queries could involve entities with relationships among them, a single entity 
may be included in multi [...]
+
+Query-driven modeling
+^^^^^^^^^^^^^^^^^^^^^
+
+Unlike a relational database model in which queries make use of table joins to 
get data from multiple tables, joins are not supported in Cassandra so all 
required fields (columns) must be grouped together in a single table. Since 
each query is backed by a table, data is duplicated across multiple tables in a 
process known as denormalization. Data duplication and a high write throughput 
are used to achieve a high read performance.
+
+Goals
+^^^^^
+
+The choice of the primary key and partition key is important to distribute 
data evenly across the cluster. Keeping the number of partitions read for a 
query to a minimum is also important because different partitions could be 
located on different nodes and the coordinator would need to send a request to 
each node adding to the request overhead and latency. Even if the different 
partitions involved in a query are on the same node, fewer partitions make for 
a more efficient query.
+
+Partitions
+^^^^^^^^^^
+
+Apache Cassandra is a distributed database that stores data across a cluster 
of nodes. A partition key is used to partition data among the nodes. Cassandra 
partitions data over the storage nodes using a variant of consistent hashing 
for data distribution. Hashing is a technique used to map data with which given 
a key, a hash function generates a hash value (or simply a hash) that is stored 
in a hash table. A partition key is generated from the first field of a primary 
key.   Data partiti [...]
+
+As an example of partitioning, consider table ``t`` in which ``id`` is the 
only field in the primary key.
+
+::
+
+ CREATE TABLE t (
+    id int,
+    k int,
+    v text,
+    PRIMARY KEY (id)
+ );
+
+The partition key is generated from the primary key ``id`` for data 
distribution across the nodes in a cluster. 
+
+Consider a variation of table ``t`` that has two fields constituting the 
primary key to make a composite or compound primary key.  
+
+::
+
+ CREATE TABLE t (
+    id int,
+    c text,
+    k int,
+    v text,
+    PRIMARY KEY (id,c)
+ );
+
+For the table ``t`` with a composite primary key the first field ``id`` is 
used to generate the partition key and the second field ``c`` is the clustering 
key used for sorting within a partition.  Using clustering keys to sort data 
makes retrieval of adjacent data more efficient.  
+
+In general,  the first field or component of a primary key is hashed to 
generate the partition key and the remaining fields or components are the 
clustering keys that are used to sort data within a partition. Partitioning 
data  improves the efficiency of reads and writes. The other fields that are 
not primary key fields may be indexed separately to further improve query 
performance. 
+
+The partition key could be generated from multiple fields if they are grouped 
as the first component of a primary key.  As another variation of the table 
``t``, consider a table with the first component of the primary key made of two 
fields grouped using parentheses.
+
+::
+ 
+ CREATE TABLE t (
+    id1 int,
+    id2 int,
+    c1 text,
+    c2 text
+    k int,
+    v text,
+    PRIMARY KEY ((id1,id2),c1,c2)
+ );
+
+For the preceding table ``t`` the first component of the primary key 
constituting fields ``id1`` and ``id2`` is used to generate the partition key 
and the rest of the fields ``c1`` and ``c2`` are the clustering keys used for 
sorting within a partition.  
+
+Comparing with Relational Data Model
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
+ 
+Relational databases store data in tables that have relations with other 
tables using foreign keys. A relational database’s approach to data modeling is 
table-centric. Queries must use table joins to get data from multiple tables 
that have a relation between them. Apache Cassandra does not have the concept 
of foreign keys or relational integrity. Apache Cassandra’s data model is based 
around designing efficient queries; queries that don’t involve multiple tables. 
Relational databases nor [...]
+
+Examples of Data Modeling
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As an example, a ``magazine`` data set consists of data for magazines with 
attributes such as magazine id, magazine name, publication frequency, 
publication date, and publisher.  A basic query (Q1) for magazine data is to 
list all the magazine names including their publication frequency. As not all 
data attributes are needed for Q1 the data model would only consist of ``id`` ( 
for partition key), magazine name and publication frequency as shown in Figure 
1.
+
+.. figure:: images/Figure_1_data_model.jpg
+
+Figure 1. Data Model for Q1
+
+Another query (Q2)  is to list all the magazine names by publisher.  For Q2 
the data model would consist of an additional attribute ``publisher`` for the 
partition key. The ``id`` would become the clustering key for sorting within a 
partition.   Data model for Q2 is illustrated in Figure 2.
+
+.. figure:: images/Figure_2_data_model.jpg
+
+Figure 2. Data Model for Q2
+
+Designing Schema
+^^^^^^^^^^^^^^^^ 
+
+After the conceptual data model has been created a schema may be  designed for 
a query. For Q1 the following schema may be used.
+
+::
+
+ CREATE TABLE magazine_name (id int PRIMARY KEY, name text, 
publicationFrequency text)
+
+For Q2 the schema definition would include a clustering key for sorting.
+
+::
+
+ CREATE TABLE magazine_publisher (publisher text,id int,name text, 
publicationFrequency text,  
+ PRIMARY KEY (publisher, id)) WITH CLUSTERING ORDER BY (id DESC)
+
+Data Model Analysis
+^^^^^^^^^^^^^^^^^^^
+
+The data model is a conceptual model that must be analyzed and optimized based 
on storage, capacity, redundancy and consistency.  A data model may need to be 
modified as a result of the analysis. Considerations or limitations that are 
used in data model analysis include:
+
+- Partition Size
+- Data Redundancy
+- Disk space
+- Lightweight Transactions (LWT)
+
+The two measures of partition size are the number of values in a partition and 
partition size on disk. Though requirements for these measures may vary based 
on the application a general guideline is to keep number of values per 
partition to below 100,000 and disk space per partition to below 100MB.
+
+Data redundancies as duplicate data in tables and multiple partition 
replicates are to be expected in the design of a data model , but nevertheless 
should be kept in consideration as a parameter to keep to the minimum. LWT 
transactions (compare-and-set, conditional update) could affect performance and 
queries using LWT should be kept to the minimum. 
+
+Using Materialized Views
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. warning::  Materialized views (MVs) are experimental in the latest (4.0) 
release.  
+
+Materialized views (MVs) could be used to implement multiple queries for a 
single table. A materialized view is a table built from data from another 
table, the base table, with new primary key and new properties. Changes to the 
base table data automatically add and update data in a MV.  Different queries 
may be implemented using a materialized view as an MV's primary key differs 
from the base table. Queries are optimized by the primary key definition.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[cassandra] branch trunk updated: Added intro to data modeling section.

Reply via email to