Benedict created CASSANDRA-7447:
-----------------------------------

             Summary: New Storage Format
                 Key: CASSANDRA-7447
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
             Project: Cassandra
          Issue Type: Improvement
          Components: Core
            Reporter: Benedict
            Assignee: Benedict
             Fix For: 3.0
         Attachments: ngcc-storage.odp

h2. Storage Format Proposal

C* has come a long way over the past few years, and unfortunately our storage 
format hasn't kept pace with the data models we are now encouraging people to 
utilise. This ticket proposes a collections of storage primitives that can be 
combined to serve these data models more optimally.

It would probably help to first state the data model at the most abstract 
level. We have a fixed three-tier structure: We have the partition key, the 
clustering columns, and the data columns. Each have their own characteristics 
and so require their own specialised treatment.

I should note that these changes will necessarily be delivered in stages, and 
that we will be making some assumptions about what the most useful features to 
support initially will be. Any features not supported will require sticking 
with the old format until we extend support to all C* functionality.

h3. Partition Key
* This really has two components: the partition, and the value. Although the 
partition is primarily used to distribute across nodes, it can also be used to 
optimise lookups for a given key within a node
* Generally partitioning is by hash, and for the moment I want to focus this 
ticket on the assumption that this is the case
* Given this, it makes sense to optimise our storage format to permit O(1) 
searching of a given partition. It may be possible to achieve this with little 
overhead based on the fact we store the hashes in order and know they are 
approximately randomly distributed, as this effectively forms an immutable 
contiguous split-ordered list (see Shalev/Shavit, or CASSANDRA-7282), so we 
only need to store an amount of data based on how imperfectly distributed the 
hashes are, or at worst a single value per block.
* This should completely obviate the need for a separate key-cache, which will 
be relegated to supporting the old storage format only

h3. Primary Key / Clustering Columns
* Given we have a hierarchical data model, I propose the use of a 
cache-oblivious trie
* The main advantage of the trie is that it is extremely compact and _supports 
optimally efficient merges with other tries_ so that we can support more 
efficient reads when multiple sstables are touched
* The trie will be preceded by a small amount of related data; the full 
partition key, a timestamp epoch (for offset-encoding timestamps) and any other 
partition level optimisation data, such as (potentially) a min/max timestamp to 
abort merges earlier
* Initially I propose to limit the trie to byte-order comparable data types 
only (the number of which we can expand through translations of the important 
types that are not currently)
* Crucially the trie will also encapsulate any range tombstones, so that these 
are merged early in the process and avoids re-iterating the same data
* Results in true bidirectional streaming without having to read entire range 
into memory

h3. Values
There are generally two approaches to storing rows of data: columnar, or 
row-oriented. The above two data structures can be combined with a value 
storage scheme that is based on either. However, given the current model we 
have of reading large 64Kb blocks for any read, I am inclined to focus on 
columnar support first, as this delivers order-of-magnitude benefits to those 
users with the correct workload, while for most workloads our 64Kb blocks are 
large enough to store row-oriented data in a column-oriented fashion without 
any performance degradation (I'm happy to consign very large row support to 
phase 2). 

Since we will most likely target both behaviours eventually, I am currently 
inclined to suggest that static columns, sets and maps be targeted for a 
row-oriented release, as they don't naturally fit in a columnar layout without 
secondary heap-blocks. This may be easier than delivering heap-blocks also, as 
it keeps both implementations relatively clean. This is certainly open to 
debate, and I have no doubt there will be conflicting opinions here.

Focusing on our columnar layout, the goals are:
* Support sparse and dense column storage
* Efficient compression of tombstones, timestamps, ttls, etc. into near-zero 
space based on offset/delta/bitmap encoding
* Normalisation of column names once per file
* Per-file block-layout index, defining how each block's data is encoded, so we 
can index directly within a block for dense fields (permitting more efficient 
page cache utilisation)
* Configurable grouping of fields per block, so that we can get closer to 
row-oriented or column-oriented performance, based on user goals

I have attached my slides from the ngcc for reference.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to