[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2017-05-15 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16011512#comment-16011512
 ] 

sankalp kohli commented on CASSANDRA-7447:
--

Anyone interested in working on this? 

> New sstable format with support for columnar layout
> ---
>
> Key: CASSANDRA-7447
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Benedict
>Assignee: sankalp kohli
>  Labels: performance, storage
> Fix For: 4.x
>
> Attachments: ngcc-storage.odp, storage_format.pdf
>
>
> h2. Storage Format Proposal
> C* has come a long way over the past few years, and unfortunately our storage 
> format hasn't kept pace with the data models we are now encouraging people to 
> utilise. This ticket proposes a collections of storage primitives that can be 
> combined to serve these data models more optimally.
> It would probably help to first state the data model at the most abstract 
> level. We have a fixed three-tier structure: We have the partition key, the 
> clustering columns, and the data columns. Each have their own characteristics 
> and so require their own specialised treatment.
> I should note that these changes will necessarily be delivered in stages, and 
> that we will be making some assumptions about what the most useful features 
> to support initially will be. Any features not supported will require 
> sticking with the old format until we extend support to all C* functionality.
> h3. Partition Key
> * This really has two components: the partition, and the value. Although the 
> partition is primarily used to distribute across nodes, it can also be used 
> to optimise lookups for a given key within a node
> * Generally partitioning is by hash, and for the moment I want to focus this 
> ticket on the assumption that this is the case
> * Given this, it makes sense to optimise our storage format to permit O(1) 
> searching of a given partition. It may be possible to achieve this with 
> little overhead based on the fact we store the hashes in order and know they 
> are approximately randomly distributed, as this effectively forms an 
> immutable contiguous split-ordered list (see Shalev/Shavit, or 
> CASSANDRA-7282), so we only need to store an amount of data based on how 
> imperfectly distributed the hashes are, or at worst a single value per block.
> * This should completely obviate the need for a separate key-cache, which 
> will be relegated to supporting the old storage format only
> h3. Primary Key / Clustering Columns
> * Given we have a hierarchical data model, I propose the use of a 
> cache-oblivious trie
> * The main advantage of the trie is that it is extremely compact and 
> _supports optimally efficient merges with other tries_ so that we can support 
> more efficient reads when multiple sstables are touched
> * The trie will be preceded by a small amount of related data; the full 
> partition key, a timestamp epoch (for offset-encoding timestamps) and any 
> other partition level optimisation data, such as (potentially) a min/max 
> timestamp to abort merges earlier
> * Initially I propose to limit the trie to byte-order comparable data types 
> only (the number of which we can expand through translations of the important 
> types that are not currently)
> * Crucially the trie will also encapsulate any range tombstones, so that 
> these are merged early in the process and avoids re-iterating the same data
> * Results in true bidirectional streaming without having to read entire range 
> into memory
> h3. Values
> There are generally two approaches to storing rows of data: columnar, or 
> row-oriented. The above two data structures can be combined with a value 
> storage scheme that is based on either. However, given the current model we 
> have of reading large 64Kb blocks for any read, I am inclined to focus on 
> columnar support first, as this delivers order-of-magnitude benefits to those 
> users with the correct workload, while for most workloads our 64Kb blocks are 
> large enough to store row-oriented data in a column-oriented fashion without 
> any performance degradation (I'm happy to consign very large row support to 
> phase 2). 
> Since we will most likely target both behaviours eventually, I am currently 
> inclined to suggest that static columns, sets and maps be targeted for a 
> row-oriented release, as they don't naturally fit in a columnar layout 
> without secondary heap-blocks. This may be easier than delivering heap-blocks 
> also, as it keeps both implementations relatively clean. This is certainly 
> open to debate, and I have no doubt there will be conflicting opinions here.
> Focusing on our columnar layout, the goals are:
> * Support sparse and dense column 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2015-02-10 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315250#comment-14315250
 ] 

Benedict commented on CASSANDRA-7447:
-

If the dimensionality of the map is fixed, I don't see why not. Certainly much 
closer, anyway.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.1

 Attachments: ngcc-storage.odp, storage_format.pdf


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation (I'm happy to consign very large row support to 
 phase 2). 
 Since we will most likely target both behaviours eventually, I am currently 
 inclined to suggest that static columns, sets and maps be targeted for a 
 row-oriented release, as they don't naturally fit in a columnar layout 
 without secondary heap-blocks. This may be easier than delivering heap-blocks 
 also, as it keeps both implementations relatively clean. This is certainly 
 open to debate, and I have no doubt there will be conflicting opinions here.
 Focusing on our columnar layout, the goals are:
 * Support sparse and dense 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2015-02-10 Thread Jeremy Hanna (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315182#comment-14315182
 ] 

Jeremy Hanna commented on CASSANDRA-7447:
-

So just to clarify, is this approach going to make the cql types more native to 
the storage format so that things like collections are going to be more 
performant?  I've seen instances where use of collections make compaction 
throughput peg the CPU (latter) and bottleneck Cassandra.  A more native 
approach would solve that, but that depends on how it's structured.  If we're 
doing nestable collections, UDTs, and JSON in some form, does that negate the 
possibility of doing more native types and hence making more complicated 
structures less performant?

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.1

 Attachments: ngcc-storage.odp, storage_format.pdf


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation (I'm happy to consign very large row support to 
 phase 2). 
 Since we will most likely target both behaviours eventually, I am 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2015-01-27 Thread Evan Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294160#comment-14294160
 ] 

Evan Chan commented on CASSANDRA-7447:
--

Hey guys,

I'm working on a project to implement a columnar database on top of Cassandra, 
so I found this really interesting.  I have a repo comparing regular CQL layout 
with a columnar layout, and I can confirm [~benedict]'s comments -- for certain 
queries am seeing over two orders of magnitude speedup:

https://github.com/velvia/cassandra-gdelt

It would be super useful to have an option for a columnar layout, or to have 
that portion of the storage format be pluggable somehow.

I'm splitting out my research into a zero-serialization, high performance 
vector binary format into a separate repo, drop me a note if you guys are 
interested.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.1

 Attachments: ngcc-storage.odp, storage_format.pdf


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation (I'm happy to consign very large row support 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119685#comment-14119685
 ] 

Benedict commented on CASSANDRA-7447:
-

[~jkrupan] when we say columnar, we mean the general columnar database (not 
column family!) sense of the word. Either row or columnar-oriented can have 
clustering columns, but _without_ clustering columns, columnar storage makes 
very little sense as each partition will contain only one row (and so there is 
no optimisation to be made for accessing multiple rows for only a subset of 
columns). It could still be sensible if the data were sufficiently similar 
across partitions that a high compression impact could be had, and we will 
likely support some hybrid of row/columnar, which might be especially 
applicable for such use cases, but in general you should be thinking 
_timeseries data_ and the like for utilising columnar storage.

Delta encoding (once delivered) won't have any such requirement, no. It will 
operate over a single column's adjacent values, i.e. the values from each 
adjacent row for a given column

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14119698#comment-14119698
 ] 

Benedict commented on CASSANDRA-7447:
-

[~slebresne], one thing I want to raise more explicitly (I alluded to, but want 
to make sure there's plenty of clarity of intent to open debate if necessary):

In the new storage I want to move away from relying on overloading our 
comparators to achieve all of our functionality, and deliver slightly special 
cased implementations supporting these features, namely collections and static 
columns, but likely also non-frozen UDT functionality as well. This will be 
built on top of primarily existing functionality, but the more explicit we are 
at managing these features at the storage level the more efficient the end 
result is likely to be. In particular, collections will no longer live in the 
clustering area at all, and will be encoded entirely in the value area, but 
will be queryable within that. Static columns may be stored completely 
separately to support improved caching and faster/cheaper compaction.

The upshot is that, in the medium term, this will hopefully feed back into the 
rest of Cassandra as well, although the time horizon for that completing may be 
lengthy (since memtables and cells are unlikely to be replaced in time for 
3.0). Eventually I would like to see comparators return to simplicity.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-02 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118109#comment-14118109
 ] 

Jack Krupansky commented on CASSANDRA-7447:
---

Thanks, [~benedict], so Just to paraphrase, columnar is referring to the CQL 
use of clustering columns - multiple/many CQL rows per partition, and 
row-oriented is referring to a primary key consisting of only partition key 
columns with no clustering columns, so that row-oriented means only a single 
CQL row per partition, right?

One clarification, does the delta encoding require that each CQL row have only 
one column, so that each adjacent cell is for the same CQL column, or... is 
delta-coding effective when the CQL row has a sequence of columns that map to a 
repeating sequence of adjacent cells, but the cells for a particular CQL column 
are never immediately adjacent in the partition?


 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117115#comment-14117115
 ] 

Sylvain Lebresne commented on CASSANDRA-7447:
-

I do not want to maintain the old file format and its related code for an 
indeterminate amount of time, and in fact, I don't see reason to risk not 
having the new format obsolete the old one right away. And there is enough 
things to improve with the current file format that I'm absolutely convinced we 
can start with a version that does support everything and yet improve on the 
existing, while still not limiting further, more specific, optimizations. Given 
that it's the case, it's seems obviously the right place to start to me and so, 
unless the majority disagrees with me, I'm oppose to doing it any other way. If 
it's not the more dramatic changes first, that's fine by me, I do not think 
that prioritizing specific workload first is a good idea and because Benedict 
wants to be sure to have it's dramatic improvement in 3.0 is not a very strong 
argument in my book.

On the issue of byte-order comparisons, I'm not against having optimized 
components for it, but I'm a very strong -1 on not supporting them at all. They 
may not be widespread but they *are* used and have their use. And the fact that 
we have this kind of debate when discussing the very first iteration of a new 
file format is what make me convinced that we're doing it wrong. 

I've understood what you prefer, but I disagree that this is the proper order 
for doing it and more precisely I strongly think that we should start with a 
version that supports everything. I'm completely on board on the issue of 
having a more modular file format and I'm good in general with the optimization 
proposed here, as long as they are done as optimization in a 2 step (which 
would allow to have a more focused debate on the details of those more 
specialized optimizations in particular).

At this point, I'm interested in getting other opinions.



 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117186#comment-14117186
 ] 

Benedict commented on CASSANDRA-7447:
-

bq. At this point, I'm interested in getting other opinions.

Agreed. Although you seem to have brushed over my point about delivering this 
project incrementally. Picking easier subsections of functionality that are on 
an approximately linear line to delivery of all functionality makes the work 
more digestible, testable and debatable. It also happens to allow us to 
guarantee delivery of high quality functionality in time for 3.0, even if it's 
possibly only a subset (though a good chance we can hit all of it), instead of 
delivering fairly average functionality that covers all users.

As I also tried to make clear, if your goal is to eliminate the old format 
entirely and this is widely supported then once we have delivered the first 
tranche of components we can relatively trivially deliver a suboptimal addendum 
to support all of the old features (with only moderate benefit over the legacy 
format). However even if the goal of eliminating the legacy format is decided 
upon, it seems best to leave this until much nearer GA as if we can deliver the 
_full_ featureset before GA this is simply wasted effort.


 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117240#comment-14117240
 ] 

Sylvain Lebresne commented on CASSANDRA-7447:
-

bq. you seem to have brushed over my point about delivering this project 
incrementally

I'm completely on board about doing it incrementally. I'm just not sure what is 
intrinsically less incremental in what I'm suggesting. We both agree that we 
want a modular file-format, we just disagree on what components we should start 
with. I'm saying we should start with generic components with no more ambition 
than to be no less efficient than what we have currently (but let's be clear 
that it's not very hard to improve on at least: avoiding repeating clustering 
columns, using a compact encoding for CQL column names, and use offset-encoding 
for timestamps, all of which are part of what you are suggesting for the first 
iteration anyway). You want to start with specialized components that are 
somewhat optimal for some workloads. I claim that there is a lot more to debate 
in what you're suggesting and I certainly refute the claim that what I'm 
suggesting is less digestible and testable (I would argue to the contrary in 
fact).

bq. It also happens to allow us to guarantee delivery of high quality 
functionality in time for 3.0, even if it's possibly only a subset (though a 
good chance we can hit all of it), instead of delivering fairly average 
functionality that covers all users

I do prefer delivering moderate improvements for all cases first and risking 
that the bigger but more restricted improvement slip by, because this also mean 
reducing the risk of having to maintain the old format longer (and as said 
above, I do claim it's more digestible/incremental in practice). In fact, your 
comments so far have strongly suggested that having the file format cover 
everything was not even your second step.

bq. However even if the goal of eliminating the legacy format is decided upon, 
it seems best to leave this until much nearer GA as if we can deliver the full 
featureset before GA this is simply wasted effort.

And to be clear, I respectfully disagree (that it's best, but I also don't see 
how it's wasted effort). I ask that we collectively decide both on 1) whether 
or not eliminating the old format is a goal (I'm very very strongly opposed to 
this not being a goal at all personally) and 2) whether or not we should start 
by covering everything first and optimize later for some cases versus starting 
by covering specific subsets first and add the more general part second.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117267#comment-14117267
 ] 

Benedict commented on CASSANDRA-7447:
-

It sounds like we're approaching agreement on our terms of disagreement, which 
is great :)

bq. I'm just not sure what is intrinsically less incremental in what I'm 
suggesting. 

The relatively simple (but still time consuming) solution you seem to be 
advocating sounds like one we would obsolete entirely once the full featureset 
I want to deliver is completed, since it is less efficient. It's quite possible 
it wouldn't even translate in the way you imagine, since the indexing is very 
different in the new scheme and so the current approach would not be possible 
(we need to implement some kind of tree structure for search in clustering 
columns, at the very least, since we will not have any of the current KeyCache 
machiner, and limiting ourselves to byte-ordered types permits us to not only 
use but _assume_ especially efficient tries that are both more efficient on 
disk but also algorithmically more efficient for merging on read). But, either 
way, this obsolescence seems to largely get in the way of the real end-state 
we're aiming for, by introducing unnecessary work. I want incremental and 
non-superfluous deliverables en route.

As I stated above I might be willing to aim for a row-oriented approach first 
(although incrementally delivered, so likely not supporting everything in one 
step, simply because this is easier and more digestable) which, once fully 
delivered, would support everything - except this implementation I want to aim 
for would be for byte-ordered clustering columns only. If instead we want to 
support non-byte-ordered, we either have to do more work to deliver an optimal 
result (i.e. a complex hybrid btree/trie-variant), which is distinctly less 
incremental, or deliver a suboptimal result that only exists temporarily until 
optimised, so is wasted effort. Either seem to run the risk of a reduction in 
final featureset when done earlier in the process, simply by consuming precious 
time, so I prefer to implement these features much later, if at all (like I 
said, I strongly favour dropping support altogether), since at last minute we 
can simply deliver an average/poor implementation if necessary. Especially 
given we know of no use cases outside of CQL types, to my knowledge, and we'll 
be eliminating those from the equation.

In summary, we need to agree on your points (1) and (2) with my new comments, 
but also need (1a) and (2a):

(1a): Do we want to try and eliminate this early in the process, or should we 
aim our end state for eliminating it and only fill in our suboptimal solution 
we'd deliver earlier if we are running out of time?
(2a): Does 'everything' include non-byte-ordered types, or if we can try to EOL 
them. Bearing in mind the cost if this needs to be aborted is not dramatic, but 
the payoff is potentially huge by simplifying many chunks of low-level codebase 
(and permitting pretty substantial computational benefits).


 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117286#comment-14117286
 ] 

Aleksey Yeschenko commented on CASSANDRA-7447:
--

It was my understanding that CASSANDRA-7743 was a temporary solution, there 
only until we migrate all the date to the new format. And we absolutely do not 
want to keep two or more formats indefinitely - getting rid of the old format 
*is* one of the goals.

As for the new format itself - yes, we do need it to support all the use cases 
(because we want to ultimately have just one format, with support for all the 
things).

Question is how to deliver it incrementally while making sure that it will 
eventually (and soon enough) support all the things.

I’d settle for listing all the features the initial format won’t support in the 
beginning, and then detailed solutions for each of those features - but at that 
point we might as well code it up. That, and I’m not sure we can guarantee we 
can even make such a list. There are many cases, obscure ones, that 
nevertheless must be supported - empty values in primitive types, mixed 
dynamic/static ‘thrift CFs’, non byteorder-comparable, incl. custom, types. 
Having an upgrade path for all of those is a strict requirement that you can’t 
simply wish away  (very unfortunately). It would be unacceptable to get close 
to 3.0 and only then realize that oops, there is a use case that we can’t 
implement with the new format, gotta stick with keeping both formats forever.

What would even be the logistics w/ a ‘feature’ not implemented by a new 
format? If the format is set to NEW in the yaml, and someone writes an empty BB 
to an int column - what do we do on flush? If someone adds a 
non-byteorder-comparable column, or a collection - at what stage do we reject 
that?

That said, I think we are all in agreement that the current format can and 
should be replaced. And I agree that we should go for the big win here. I just 
think that the real big win is the one that would benefit all of our users (and 
“avoiding repeating clustering columns, using a compact encoding for CQL column 
names, and use offset-encoding for timestamps” benefits everyone - 1 and 3 even 
dense table users).

This way we can deliver something strictly better than the old format to 
everyone, without the risk of getting stuck with two formats forever, and 
optimize for the special cases later.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117290#comment-14117290
 ] 

Benedict commented on CASSANDRA-7447:
-

To clarify, the only feature I've suggested not supporting is non-byte-order 
comparable types (so, custom types are the only real point here, if they aren't 
byte-order comparable - but we could offer an upgrade path that requires users 
to provide a function that translates their non-byte-ordered variant to a 
byte-ordered variant). I don't want to support the old format to maintain 
support for these, I want to drop support altogether to simplify some of our 
tight loop code paths, guarantee efficient merging on read, and reduce 
implementation complexity for efficient operation in the new file format.

Supporting empty values in int columns is something we'll definitely support 
out of the box. However later down the line I would like to roll out data types 
that do NOT support empty values and which we can then optimise this away, 
however it is not a huge issue since the new formats will encode all of this 
information in a bitmap.


 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117293#comment-14117293
 ] 

Benedict commented on CASSANDRA-7447:
-

Also,  CASSANDRA-7743 is mostly unrelated to this I'm pretty sure

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation (I'm happy to consign very large row support to 
 phase 2). 
 Since we will most likely target both behaviours eventually, I am currently 
 inclined to suggest that static columns, sets and maps be targeted for a 
 row-oriented release, as they don't naturally fit in a columnar layout 
 without secondary heap-blocks. This may be easier than delivering heap-blocks 
 also, as it keeps both implementations relatively clean. This is certainly 
 open to debate, and I have no doubt there will be conflicting opinions here.
 Focusing on our columnar layout, the goals are:
 * Support sparse and dense column storage
 * Efficient compression of 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117295#comment-14117295
 ] 

Aleksey Yeschenko commented on CASSANDRA-7447:
--

bq. Also, CASSANDRA-7743 is mostly unrelated to this I'm pretty sure

Meant CASSANDRA-7443

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation (I'm happy to consign very large row support to 
 phase 2). 
 Since we will most likely target both behaviours eventually, I am currently 
 inclined to suggest that static columns, sets and maps be targeted for a 
 row-oriented release, as they don't naturally fit in a columnar layout 
 without secondary heap-blocks. This may be easier than delivering heap-blocks 
 also, as it keeps both implementations relatively clean. This is certainly 
 open to debate, and I have no doubt there will be conflicting opinions here.
 Focusing on our columnar layout, the goals are:
 * Support sparse and dense column 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117301#comment-14117301
 ] 

Benedict commented on CASSANDRA-7447:
-

To clarify, further, my complete list of distinct featuresets we can deliver:

* container format, which places relevant blocks of following components 
optimally using a complex strategy I will outline
* partition index
** hashed (phase 1)
** byte-ordered (phase 2)
* clustering column indexes
** byte-ordered
** [non-byte-ordered, basic]
** [non-byte-ordered, btree]
** [non-byte-ordered/byte-ordered hybrid]
* value encoding
** row oriented
** column oriented
*** some extended variants on functionality of column oriented 
** row/column hybrid (actually possibly may retire this goal now; given latest 
approach it may not be sufficiently worth delivering)
** efficient collection and complex UDT encoding/decoding (specifically, these 
will NOT inflate at the clustering column index level)

All other features not mentioned are assumed to be supported, and if we've 
missed something we'll have to deal with it somehow. If there's some kind of 
nook and cranny I'm not currently aware of that will bite us later, please air 
it now for discussion, as it will at least help design.

Anyway, we *could* deliver some kind of [non-byte-ordered,basic] implementation 
first time around which would deliver most functionality, just very 
suboptimally. If we want to deliver row-oriented first and want to support 
everything we need to also deliver efficient collection/UDT encoding/decoding. 

However delivering a non-trie based index means we inflate the amount of code 
dramatically, especially on read/merge/search, and introduce a great deal more 
conditional complexity since we cannot assume. I would strongly prefer to avoid 
this, given it yields little benefit and is avoidable at this crossroads.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117303#comment-14117303
 ] 

Aleksey Yeschenko commented on CASSANDRA-7447:
--

bq. we could offer an upgrade path that requires users to provide a function 
that translates their non-byte-ordered variant to a byte-ordered variant

I'm not sure if that counts as 'offer an upgrade path' - we can't guarantee 
that every type can be made byte-order-comparable. But forget custom types. How 
can you make, for example, a (frozen) set of blobs, on its own or inside a UDT, 
byte-order-comparable? And those are already a thing in 2.1.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation (I'm happy to consign very large row support to 
 phase 2). 
 Since we will most likely target both behaviours eventually, I am currently 
 inclined to suggest that static columns, sets and maps be targeted for a 
 row-oriented release, as they 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117307#comment-14117307
 ] 

Sylvain Lebresne commented on CASSANDRA-7447:
-

bq. To clarify, the only feature I've suggested not supporting is 
non-byte-order comparable types

And I reiter my strong -1 on that at this point. It's also not just custom 
types imo: we allow UDT in comparators that could include collections, and we 
will allow serialized collection more generally. And while we might be able 
to make those byte-order comparable at some point, I suspect it's not trivial 
at all. Also, to the best of my knowledge, we have no clear well defined and 
tested way to convert all CQL types to byte-order, and for some (decimal, 
varint, uuid (whose comparison depends on the uuid type)), it's far from 
trivial. So outside of the fact that, again, I'm -1 on dropping support for 
non-byte-order types altogether, it also seems to me that covering all but 
custom types for 3.0 is a fairly risky bet. I somewhat doubt that starting with 
a more general, if less efficient, indexing for clustering columns will be a 
big waste of efforts.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117309#comment-14117309
 ] 

Benedict commented on CASSANDRA-7447:
-

OK, you've convinced me. We certainly could make collections byte-order 
comparable, but you're right that it would be unpleasant to do so.

However using these as clustering columns is already inherently inefficient, so 
I propose we stick to delivering a relatively average-to-poor implementation 
which is dropped in automatically only when the types aren't all byte-order 
comparable.

I'm pretty sure we can make decimal, varint and each uuid type byte-order 
comparable.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation (I'm happy to consign very large row support to 
 phase 2). 
 Since we will most likely target both behaviours eventually, I am currently 
 inclined to suggest that static columns, sets and maps be targeted for a 
 row-oriented release, as 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117313#comment-14117313
 ] 

Benedict commented on CASSANDRA-7447:
-

[~iamaleksey]: CASSANDRA-7443 is unlikely to be removed once we obsolete the 
old file format. It only continues to help us preparing gen 4.0, which may be 
even more dramatic than 3.0

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation (I'm happy to consign very large row support to 
 phase 2). 
 Since we will most likely target both behaviours eventually, I am currently 
 inclined to suggest that static columns, sets and maps be targeted for a 
 row-oriented release, as they don't naturally fit in a columnar layout 
 without secondary heap-blocks. This may be easier than delivering heap-blocks 
 also, as it keeps both implementations relatively clean. This is certainly 
 open to debate, and I have no doubt there will be conflicting opinions here.
 Focusing 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117318#comment-14117318
 ] 

Sylvain Lebresne commented on CASSANDRA-7447:
-

bq. I propose we stick to delivering a relatively average-to-poor 
implementation which is dropped in automatically only when the types aren't all 
byte-order comparable

I don't have have a problem with that. I merely don't want to maintain 2 file 
formats, and so I want to prioritize having some indexing that don't assume 
byte-order. Happy to keep it relatively simple so we have time to add the 
byte-oder index part too (I don't think adding a fallback general indexing will 
delay us much if we keep it simple (we can always optimize it latter anyway if 
we want) and I'm happy to said that we'll hold 3.0 for the byte-ordered part if 
we're slightly late). The advantage I see being also that we'll feel a little 
bit less in a rush to deliver conversions to byte-order for all types 
(typically, maybe decimal will prove challanging and ugly, but it's also not 
all that often used in clustering columns so it's totally ok if we don't have a 
conversion for 3.0).

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117331#comment-14117331
 ] 

Benedict commented on CASSANDRA-7447:
-

OK, after a bit more consideration I'm willing to accept that tack.

We still need to get wider discussion about if we want to prioritise column 
oriented (which is likely to yield the largest performance implications) over 
row oriented. The only major reason this might be beneficial is that it 
actually inherently supports fewer features, so we don't need to deliver all of 
the complexity out of the box (so we can deliver it faster). I don't have a 
strong position either way, however. Assuming we deliver row-oriented in time, 
we can likely drop column oriented quite quickly afterwards as well.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation (I'm happy to consign very large row support to 
 phase 2). 
 Since we will most likely target both behaviours 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117373#comment-14117373
 ] 

Jack Krupansky commented on CASSANDRA-7447:
---

Can you guys provide two very simple table definitions and corresponding CQL 
queries that exemplify row vs. columnar storage and processing optimality?

IOW, the two key test cases that would confirm the extent to which the goals of 
this issue are met, although that might include a narrative about how much 
updating and deletions are impacting storage and performance as well.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation (I'm happy to consign very large row support to 
 phase 2). 
 Since we will most likely target both behaviours eventually, I am currently 
 inclined to suggest that static columns, sets and maps be targeted for a 
 row-oriented release, as they don't naturally fit in a columnar layout 
 without secondary heap-blocks. This 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117375#comment-14117375
 ] 

Jack Krupansky commented on CASSANDRA-7447:
---

I'm also interested in the impact of the proposal on storage and performance on:

1. Secondary indexes - either native Cassandra, manually maintained index 
tables, or even DSE/Solr indexing. Would it make them faster, more compact, 
less needed, or no net impact?

2. Filtering - Would it enable more/faster filtering, discourage it, or no net 
change?


 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation (I'm happy to consign very large row support to 
 phase 2). 
 Since we will most likely target both behaviours eventually, I am currently 
 inclined to suggest that static columns, sets and maps be targeted for a 
 row-oriented release, as they don't naturally fit in a columnar layout 
 without secondary heap-blocks. This may be easier than 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-09-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117876#comment-14117876
 ] 

Benedict commented on CASSANDRA-7447:
-

Columnar storage is best for wide-tables storing wide-rows - i.e., tables with 
partitions that span multiple pages on disk and with many columns defined, 
which are not all accessed simultaneously (e.g. a table of various stock 
pricing, volume and related data). It is *most* helpful when this kind of data 
is accessed for analytics (esp., say, Spark performing token aware 
aggregations), but offers advantages for simple queries also, simply by 
reducing the number of pages that need to be read. The right kinds of data can 
also require substantially less disk space, as compression operating over a 
more uniform dataset (i.e. one column) is especially effective, as is delta 
encoding in some scenarios.


 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation (I'm 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-08-31 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116806#comment-14116806
 ] 

Sylvain Lebresne commented on CASSANDRA-7447:
-

So we disagree. So I'll try to detail a bit more clearly what I disagree with 
and explain how I would rather proceed. Others may then chime in with their 
opinions (and if it turns out my objection are my own only, I'll happily yield 
to the majority).

You propose as ultimate goal to have not just one static file-format, but 
rather a somewhat modular file-format with specialized variants to handle some 
type of workload/tables. I have no problem with that ultimate goal (provided 
things are well enough though of that it's not a completely unmaintainable mess 
for no big benefits of course, but I suppose we agree on that).

Where I disagree however is how we proceed to get there and where we start. My 
understanding is that you're suggesting a first version that is pretty 
specialized:
* it would only work for table whose comparator is byte-ordered. Today, it's 
text/ascii, blob and inet and that's it. There's a plan (CASSANDRA-6936) for 
transforming values on the fly for the other types to make them byte-ordered, 
but it's still as 2nd step, and one that is not all that clear for all types.
* it would exclude static columns, sets and maps. I suspect this would 
include nested collections and UDT done right which we plan to introduce in 
3.0 too.
Now, I understand that you're not saying that the file-format would never lift 
those limitations, but they would still initial limitation of the first 
iteration. And there is no clear plan on when the rest will be delivered. In 
particular, it means we'll have to maintain the old file format for an 
indeterminate amount of time (Some of you comments even suggest that supporting 
all tables will not be a very high priority at all (not deliver support for 
this immediately, but possibly not to deliver at all until we receive a lot of 
evidence it's necessary or 3.0 storage engine won't be universal for a while 
(maybe never for thrift) in CASSANDRA-6276)).

What I would suggest instead would be this:
# Deliver that modular file-format but with for each component something that 
can handle everything we have. I claim that it's not too hard to do that and 
still have this first version being a net win on the current format (because 
that's a low bar).
# Optimize specific components for specific table structure/workload.

Doing it in this order has the following advantages imo:
* we stop relying on the old format right away. We don't have to maintain 
legacy code for an indeterminate time.
* we can discuss the details of each specific optimization more in isolation.
* we can evaluate each optimization from a much saner base.

bq. This is because supporting non-byte-ordered types not only severely 
complicates the delivery of the intra-partition indexing

I'm not sure why that is. We have an existing intra-partition indexing that 
perfectly support non-byte-ordered types and it's pretty simple (better, we 
even have the code already). Can we do better, certainly, but that doesn't mean 
that we *have to* change everything right away. Making the file format more 
modular and making it more compact (by at least (but not limited to) avoiding 
the repetition of the clustering columns for a given CQL row, using a compact 
encoding of CQL column names and using offset-encoding for timestamps) would 
certainly be a good start in my book, one that would be not too big (a feature) 
and one on which we can build on.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-08-31 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116811#comment-14116811
 ] 

Benedict commented on CASSANDRA-7447:
-

I think we're conflating here what will make it into GA, and what will be 
delvered first. I want to deliver not a specialised format, but a lightly 
modular format from which comparatively small pieces of work allow us to expand 
its power and applicabilitiy, and I want to deliver the most common and 
interesting (and most demonstrably superior) uses first. I'm happy to debate 
what we aim to release simultaneously with a GA release. My goal is to deliver, 
in order (and hopefully all before GA):

byte-ordered composite column indexing - row/columnar hybrid (i.e. a 
sub-square of a given grid in any page on disk), pages packed row-orientedly
byte-ordered composite column indexing - columnar (i.e. configurable one or 
more columns per page), pages packed column-oriented
byte-ordered composite column indexing - row oriented, nesting of statics, 
collections, UDTs etc

Note that supporting statics, collections, UDTs etc with columnar is as simple 
as splitting it out into a row-oriented (old or new) format, so it does not 
inherently limit the scope of their applicability

It's quite possible we can deliver all of this for GA of 3.0, however I want to 
focus on the biggest wins first. Now, whilst one of the very main driving 
forces for the new format is how inherently inefficient the current indexing 
is, that you mention we have already and could use, we could relatively easily 
migrate it to the new format if we wanted, or at least a version with some of 
the new algorithmic benefits but similar overall algorithmic and space 
complexity. However I don't see a tremendous benefit to this beyond leaving 
users with the old format until we can deliver the maximal algorithmic and 
space complexity benefits of an index that supports both non-byte-ordered and 
byte-ordered data simultaneously and optimally. 

More importantly, as I previously stated and you do not appear to have 
responded to directly, there are major benefits to eliminating non-byte-ordered 
comparators from the codebase entirely, and there is limited evidence that they 
are widespread outside of the C* types we support that do not currently have a 
byte-ordered representation. The work to translate these is comparatively 
straight forward, although not trivial, and yields its own benefit of relying 
on more efficient data structures and code paths to improve efficiency for 
anyone using these types regardless of our support non-byte-ordered types. 
However, once we've done this, given the limited evidence of their wide use 
outside of these types I have a very strong preference for creating some 
pressure to eliminate them entirely from the ecosystem, only letting up if the 
pushback is sufficient.

So for this reason alone I prefer to not deliver indexing supporting this. 

However if that is not generally supported as a position, we could work in 
tandem with other work to produce a simpler indexing component that has no 
better algorithmic or space complexity to the current indexing scheme and 
support it. But I would at the very least want to see it turned off by default, 
and would only rather allocate resources to it once we know we're ahead of the 
game for the big release points for 3.0

If, however, we wanted to optimally and generally, support non-byte-ordered 
comparators (which I can't quite make out what your position is on), we need to 
focus on a hybrid data structure which appears at the moment to be of 
significantly increased complexity to the considerably more optimal 
byte-ordered data structure, or a regular non-byte-ordered (only) structure. I 
definitely want to leave work on something like this until after we've 
delivered our big wins, and preferably have strong evidence it will be widely 
useful.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-08-31 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14117047#comment-14117047
 ] 

Benedict commented on CASSANDRA-7447:
-

Since the prior comment was very verbose, I want to clarify my goals and 
approach a little independently.

There are a number of major puzzle pieces the end-state of this general push 
entails, and we want to deliver it _incrementally_. As such my main desire with 
a plan of attack is to deliver the minimal components necessary to deliver some 
of the true final state functionality. i.e. components that are themselves 
complete. But also deliver components that are themselves well placed to be 
built out from (or adjacent to) to fill in the remaining end-state goals. My 
desire is to plan the puzzle pieces so that delivering a majority of 
functionality before GA is possible (but probably not guaranteed), but that 
there are many fully functional pre-end-states enroute. This is optimally done 
without producing redundant work.

I also want to slightly sharpen and narrow the focus of the storage engine so 
that it can not only be more efficient, but can also be maintained and expanded 
more efficiently. These two goals are strongly aided by dropping support for 
non-byte-ordered clustering types. A last minute pressure-induced suboptimal 
solution for these usecases would NOT be a tremendously complex task, however, 
so there doesn't seem much lost in delivering this later.

I've been considering various modifications to the approach outlined so far, 
and there are some fairly significant (though not earth shifting) changes I 
will outline in a follow up comment, when I'm in a better position to do so 
(hopefully very soon). It's quite possible, given some thought, that given the 
changes I intend to outline I would be quite comfortable delivering a 
row-oriented approach first. However even this would be delivered 
_incrementally_, and would most likely not support collections and some other 
complex features we're introducing with UDTs in 3.0 in the first blush. This 
would introduce approximately zero extra work to deliver it in stages, but 
permits much greater QA work and easier review, integration and development.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-08-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14116630#comment-14116630
 ] 

Benedict commented on CASSANDRA-7447:
-

The goal of the new file design is that it is sufficiently modular that 
modifying it to support a row-oriented approach is not very hard. However the 
magnitude of performance differential will not be as dramatic, so I prefer to 
deliver the dramatic benefits first.

Supporting ALL current users is going to be a bit more involved, as I cut from 
the scope of the initial delivery support for non-byte-ordered types (excluding 
those we can transform to byte-ordered, which includes all CQL types I'm 
reasonably confident). This is because supporting non-byte-ordered types not 
only severely complicates the delivery of the intra-partition indexing, it also 
makes it suboptimal once delivered. My preference would be to certainly not 
deliver support for this immediately, but possibly not to deliver at all until 
we receive a lot of evidence it's necessary from users asking for it when 
looking to upgrade, since encouraging its obsolescence is a positive step for 
simplifying and maintaining future performance characteristics. Either way, it 
should not be the default due to its negative implications.

In the next week or so I should be posting the outline of the next-gen 
container type, which will loosely be comprised of the three components 
outlined above, however defined in such a way that any of the components can be 
switched to yield different characteristics or support different workloads. 

In summary, we disagree on multiple levels, however delivering a columnar vs 
row is not the most important. Delivering a row-oriented solution will use 
largely the same components as a columnar approach and, assuming there is time 
before release, it's quite achievable to deliver both simultaneously. 
Supporting all current use cases in the new file format is a great deal more 
involved, however. There is certainly a great deal of benefit, either way, to 
fleshing out more of the details of the new-and-improved approaches, since the 
subtleties will not be as clear to us in advance as they are when thinking 
about/discussing the existing state of affairs. So simply rewriting the 
existing state of affairs seems to me to run the slight risk of yielding a 
piece of work that is not trivially adapted to support the more optimal use 
cases.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-08-29 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14115471#comment-14115471
 ] 

Sylvain Lebresne commented on CASSANDRA-7447:
-

bq. Any features not supported will require sticking with the old format until 
we extend support to all C* functionality.

I'm not extremely fan of that part tbh. I would have a strong preference for 
introducing a file format that is better that our current one (not very hard) 
but is generic enough to cover everything, even if this means that it's not as 
optimized for some workloads that it can possibly be. Which probably means a 
row oriented approach first. We can then optimize some special cases later.

One reason is that I'd prefer not having to keep/maintain the existing file 
format for multiple versions, but generally, it feels more sane to me to get 
one format reasonably performant first, and then add special case/variants. 
This would also delay the debate on how much we're willing to special the 
sstable format for different workload, debate on which I'm not sure we all have 
the same opinion, and debate that is currently somewhat clouded by the fact 
that the existing format is really rather inefficient.



 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-08-08 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090621#comment-14090621
 ] 

Robert Stupp commented on CASSANDRA-7447:
-

Some notes from my side (I do not have in-depth knowledge how sstables work - 
so maybe some points are foolish):
* CASSANDRA-7443 is good to clean up the code base and should be done first
* Using 32bit instead of 64bit pointers could also save some space. Two 
variants:
** Byte granular 32bit pointers for files up to 4G or
** compressed 32bit pointers (basically what Java's CompressedOOPS does) if 
we know that blocks can only begin at certain offsets - would move the 32bit 
barrier to 64g files for 16 byte boundaries
* Trie + byte ordered types: would this mean to do some special serialization 
e.g. for timeuuid to make them binary comparable?
* If one partition only contains one row, plain row-oriented storage seems to 
be more efficient. Is this what small partition layout is meant for?
* Column names (CQL): I'd prefer to extend the table definition schema with an 
integer column id and use that. Could save lots of String.hashCode/equals() 
calls - even if the column-id is also used in native protocol. (Think this was 
discussed elsewhere)
* Bikeshed: Is the term sstable still correct?

I didn't catch the point why only maps and sets don't naturally fit into 
columnar format but lists, strings and blobs do. Or is it just because of their 
mean serialized size?

If I understood the columnar concept correctly, it works best, if a whole 
partition fits in one (64k) block. The more additional (heap-)blocks one 
partition needs, the more efficient is row-oriented format. Correct?


 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-08-08 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090655#comment-14090655
 ] 

Benedict commented on CASSANDRA-7447:
-

bq. CASSANDRA-7443 is good to clean up the code base and should be done first

This, and further related patches, are a necessary prerequisite to this ticket, 
yes.

bq. Using 32bit instead of 64bit pointers could also save some space.

I would prefer not to go down this route just yet, as it is error prone to be 
optimising this in the first version. Any optimisations that can be made 
universally (i.e. guaranteed to be safe for all file sizes) I'm onboard with, 
but obfuscating code dependent on file size I'm not. Especially as this 
introduces an extra condition to execute on every single field access, 
potentially stalling the processor pipeline more readily.

bq. Trie + byte ordered types: would this mean to do some special 
serialization e.g. for timeuuid to make them binary comparable?

Yes

bq. If one partition only contains one row, plain row-oriented storage seems 
to be more efficient. Is this what small partition layout is meant for?

No, it is because it requires fewer disk accesses to have it all packed into 
the same block (or we can have smaller blocks, increasing IOPS esp. on SSDs). 
In fact it is quite reasonable to assume that even with single row partitions 
the column oriented storage will be more efficient, as the columns do not care 
about partitions; they extend across all partitions, and so the serialization 
costs are reduced even if there are no clustering columns. 

I should note that the presentation at ngcc is only for historical reference 
and to get familiar with the general discussion. As mentioned in the 
description of this ticket, I now favour a row-oriented approach backed by the 
new index structures for many of the non-optimal column-oriented use cases, 
which *may* reduce the necessity of a compact column-oriented form, although it 
would still be useful as just described.

bq. Column names (CQL): I'd prefer to extend the table definition schema with 
an integer column id and use that. Could save lots of String.hashCode/equals() 
calls - even if the column-id is also used in native protocol. (Think this was 
discussed elsewhere)

There is a separate ticket for this, and I consider it to be an orthogonal 
effort. We can more easily deliver it here than we can cross-cluster 
(personally I favour cross-cluster names to be supported by a general enum type 
(CASSANDRA-6917))

bq. Bikeshed: Is the term sstable still correct?

The original sstable was only imposing a sort-order on the partition keys. This 
will still be imposed, so yes, but I don't have any strong attachment to it.

bq. I didn't catch the point why only maps and sets don't naturally fit into 
columnar format but lists, strings and blobs do. Or is it just because of their 
mean serialized size?

They don't logically fit because they are an extra dimension, much as static 
columns are one _fewer_ dimension. Columnar layouts really need fixed 
dimensionality. You can flatten maps, sets and lists (my list was not 
exhaustive), but this incurs significant cost and complexity on reading these 
across multiple sstables, as opposed to relying on the standard machinery. 
Strings and blobs can more trivially be split out into an extra file if they 
are too large (for simplicity of first delivery we can just append all values 
larger than some limit to a file, and replace them with their location in the 
file), but storing large strings in a columnar layout is generally not 
sensible/beneficial anyway.

In all likelihood I think the best approach may be to permit collections and 
statics on column oriented tables by splitting them into a separate 
row-oriented sstable, at least in the near-term. The heap-blocks outlined in 
the ngcc talk could be delivered later, although I might be inclined to tell 
users that column oriented storage is not for them if they want to store these 
things in the table.


 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-08-08 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090949#comment-14090949
 ] 

T Jake Luciani commented on CASSANDRA-7447:
---

Is there any reason why you want to put the row index block next to the data?  
This actually makes it tricky to make sstables pluggable since right now we 
would put this index in the index.db file.  It could be in both places I 
suppose since it would help with recovery to have multiple copies.

Also if you plan of putting the index at the front of the row you would need to 
do some kind of two pass to write the partition. It would be better to place it 
at the end since for huge partitions it would cause issues (we don't do two 
pass comp actions anymore)

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation (I'm happy to consign very large row support to 
 phase 2). 
 Since we will most likely target both behaviours eventually, I am currently 
 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-08-08 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091084#comment-14091084
 ] 

Benedict commented on CASSANDRA-7447:
-

bq. Is there any reason why you want to put the row index block next to the 
data? 

If we are going out of cache, we may as well read the index + data, rather than 
then index _then_ data. With HDDs this should avoid any penalty. Bear in mind 
the index also _is_ the data in this brave new world. My goal with the new 
format is to, as far as possible, guarantee as many or fewer seeks to the old 
format (even if SSDs are becoming more prevalent), whilst reducing the total 
amount of space necessary (so reduce requisite disk bandwidth and improve cache 
occupancy).

bq. Is there any reason why you want to put the row index block next to the 
data? This actually makes it tricky to make sstables pluggable since right now 
we would put this index in the index.db file. It could be in both places I 
suppose since it would help with recovery to have multiple copies.

Why does this make pluggability hard? The index is an artefact of the sstable 
type (or it should be, before we roll this out), so it shouldn't matter?

bq. Also if you plan of putting the index at the front of the row you would 
need to do some kind of two pass to write the partition. 

Maybe. I'd prefer not to get down to this level of specifics just yet, I'm 
pretty sure it's solvable either way. It would be preferable to focus mostly on 
the overall design, featureset, etc. for the moment. The format is likely to be 
agnostic to where the two records live with respect to each other, but there 
are some optimisations possible on read if they're adjacent, assuming the 
records are all smaller than a page. If they are much larger than that, no 
optimisation is likely to help so it doesn't matter too much, and if they are 
smaller we only have to buffer two pages.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-08-08 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091194#comment-14091194
 ] 

T Jake Luciani commented on CASSANDRA-7447:
---

bq. Why does this make pluggability hard? The index is an artifact of the 
sstable type (or it should be, before we roll this out), so it shouldn't matter?

You are right but it makes things simpler since you already have all the code 
in place to put the index next to the partition location, In the end the only 
think required will be the the file offset to the start of the partitions. data 
(which can be your index)

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either. However, given the current model we 
 have of reading large 64Kb blocks for any read, I am inclined to focus on 
 columnar support first, as this delivers order-of-magnitude benefits to those 
 users with the correct workload, while for most workloads our 64Kb blocks are 
 large enough to store row-oriented data in a column-oriented fashion without 
 any performance degradation (I'm happy to consign very large row support to 
 phase 2). 
 Since we will most likely target both behaviours eventually, I am currently 
 inclined to suggest that static columns, sets and maps be targeted for a 
 row-oriented release, as they don't naturally fit in a columnar 

[jira] [Commented] (CASSANDRA-7447) New sstable format with support for columnar layout

2014-08-08 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091220#comment-14091220
 ] 

Benedict commented on CASSANDRA-7447:
-

bq. you already have all the code in place to put the index next to the 
partition location

I'm not sure I follow, but one of the goals is to permit faster _in memory_ 
(cached) performance, which means being more targeted with where we hit data 
inside our pages so that we can cache with finer granularity (and so have a 
higher cache hit rate), so we don't want to scan entire pages if we can avoid 
it. 

We have one index right now, and one data file, within both of which we persist 
clustering keys. This new scheme has one partition index, one data file, and 
one hybrid dataset, which can live by itself or in the datafile, but behaves as 
both a clustering index and the data itself. So when we're talking about an 
index things can get confusing. However we want to be able to support (and 
improve upon) the current ability to seek directly within partitions, and we 
want to be able to do so efficiently, without extra disk seeks, so we ideally 
want these clustering key records to be cached independently of the rest of the 
data since they will/may be referenced more frequently.

 New sstable format with support for columnar layout
 ---

 Key: CASSANDRA-7447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7447
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
  Labels: performance, storage
 Fix For: 3.0

 Attachments: ngcc-storage.odp


 h2. Storage Format Proposal
 C* has come a long way over the past few years, and unfortunately our storage 
 format hasn't kept pace with the data models we are now encouraging people to 
 utilise. This ticket proposes a collections of storage primitives that can be 
 combined to serve these data models more optimally.
 It would probably help to first state the data model at the most abstract 
 level. We have a fixed three-tier structure: We have the partition key, the 
 clustering columns, and the data columns. Each have their own characteristics 
 and so require their own specialised treatment.
 I should note that these changes will necessarily be delivered in stages, and 
 that we will be making some assumptions about what the most useful features 
 to support initially will be. Any features not supported will require 
 sticking with the old format until we extend support to all C* functionality.
 h3. Partition Key
 * This really has two components: the partition, and the value. Although the 
 partition is primarily used to distribute across nodes, it can also be used 
 to optimise lookups for a given key within a node
 * Generally partitioning is by hash, and for the moment I want to focus this 
 ticket on the assumption that this is the case
 * Given this, it makes sense to optimise our storage format to permit O(1) 
 searching of a given partition. It may be possible to achieve this with 
 little overhead based on the fact we store the hashes in order and know they 
 are approximately randomly distributed, as this effectively forms an 
 immutable contiguous split-ordered list (see Shalev/Shavit, or 
 CASSANDRA-7282), so we only need to store an amount of data based on how 
 imperfectly distributed the hashes are, or at worst a single value per block.
 * This should completely obviate the need for a separate key-cache, which 
 will be relegated to supporting the old storage format only
 h3. Primary Key / Clustering Columns
 * Given we have a hierarchical data model, I propose the use of a 
 cache-oblivious trie
 * The main advantage of the trie is that it is extremely compact and 
 _supports optimally efficient merges with other tries_ so that we can support 
 more efficient reads when multiple sstables are touched
 * The trie will be preceded by a small amount of related data; the full 
 partition key, a timestamp epoch (for offset-encoding timestamps) and any 
 other partition level optimisation data, such as (potentially) a min/max 
 timestamp to abort merges earlier
 * Initially I propose to limit the trie to byte-order comparable data types 
 only (the number of which we can expand through translations of the important 
 types that are not currently)
 * Crucially the trie will also encapsulate any range tombstones, so that 
 these are merged early in the process and avoids re-iterating the same data
 * Results in true bidirectional streaming without having to read entire range 
 into memory
 h3. Values
 There are generally two approaches to storing rows of data: columnar, or 
 row-oriented. The above two data structures can be combined with a value 
 storage scheme that is based on either.