[jira] [Comment Edited] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-04-13 Thread vincent.poncet (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238992#comment-15238992
 ] 

vincent.poncet edited comment on CASSANDRA-9259 at 4/13/16 10:15 AM:
-

In datawarehouse / analytics usecases, you are doing mostly full scans, 
(hopefully with some predicate pushdown either projection and filtering), but 
it is reading a big numbers of rows, then doing aggregations.
I just want to say that that takes time. So by definition, an analytic query on 
an OLTP database is always "wrong", in the sense of at the same time of doing 
the query, data changed, deleted, updated, inserted. So operational analytics 
is always approximate.
The only way to have exact result of  analytic query on an OLTP database would 
be to have capabilities to query data during a snapshot to have coherent data 
and not being impacted by the data changes during the running of the analytic 
query. That's like MVCC in RDBMS. that's has performance cost and is relaxed 
for most of analytics  workloads.

So, my point is in operational analytics, CL=1 will be perfectly fine.


was (Author: vincent.pon...@gmail.com):
In datawarehouse / analytics usecases, you are doing mostly full scans, 
(hopefully with some predicate pushdown either projection and filtering), but 
it is reading a big numbers of rows, then doing aggregations.
I just want to say that that takes time. So by definition, an analytic query on 
an OLTP database is always "wrong", in the sense of at the same time of doing 
the query, data changed, deleted, updated, inserted. So operational analytics 
is always approximate.

So, my point is in operational analytics, CL=1 will be perfectly fine.

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: bulk-read-benchmark.1.html, 
> bulk-read-jfr-profiles.1.tar.gz, bulk-read-jfr-profiles.2.tar.gz
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-

[jira] [Comment Edited] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-01-26 Thread Vassil Lunchev (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117397#comment-15117397
 ] 

Vassil Lunchev edited comment on CASSANDRA-9259 at 1/26/16 3:41 PM:


"For full data queries it may be advantageous to have C* be able to compact all 
of the relevant sstables into a format friendlier to analytics workloads."
I would even go further and say - "Cassandra needs a new compaction strategy. 
It has DateTieredCompactionStrategy for time series data. It needs a new one, 
for example ColumnarCompactionStrategy, that is similar in concept to Parquet 
and designed for analytics workloads."

The results here: https://github.com/velvia/cassandra-gdelt
and the ideas here: https://github.com/tuplejump/FiloDB
are very compelling. FiloDB is practically doing a new columnar compaction 
layer on top of C*. And the results are quite promising - "faster than Parquet 
scan speeds" with storage needs "within 35% of Parquet".


was (Author: vas...@leanplum.com):
"For full data queries it may be advantageous to have C* be able to compact all 
of the relevant sstables into a format friendlier to analytics workloads."
I would even go further and say - "Cassandra needs a new compaction strategy. 
It has DateTieredCompactionStrategy for time series data. It needs a new one, 
for example ColumnarCompactionStrategy, that is similar in concept to Parquet 
and designed for analytics workloads."

The results here: https://github.com/velvia/cassandra-gdelt
and the ideas here: https://github.com/tuplejump/FiloDB
are very compelling. FilloDB is practically doing a new columnar compaction 
layer on top of C*. And the results are quite promising - "faster than Parquet 
scan speeds" with storage needs "within 35% of Parquet".

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Priority: Critical
> Fix For: 3.x
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also wor