[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2019-07-21 Thread Swapnil Bhisey (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Swapnil Bhisey updated CASSANDRA-9259:
--
Description: 
This ticket is following on from the 2015 NGCC. This ticket is designed to be a 
place for discussing and designing an approach to bulk reading.

The goal is to have a bulk reading path for Cassandra. That is, a path 
optimized to grab a large portion of the data for a table (potentially all of 
it). This is a core element in the Spark integration with Cassandra, and the 
speed at which Cassandra can deliver bulk data to Spark is limiting the 
performance of Spark-plus-Cassandra operations. This is especially of 
importance as Cassandra will (likely) leverage Spark for internal operations 
(for example CASSANDRA-8234).

The core CQL to consider is the following:
 SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
Token(partitionKey) <= Y

There are a few approaches that could be considered. First, we consider a new 
"Streaming Compaction" approach. The key observation here is that a bulk read 
from Cassandra is a lot like a major compaction, though instead of outputting a 
new SSTable we would output CQL rows to a stream/socket/etc. This would be 
similar to a CompactionTask, but would strip out some unnecessary things in 
there (e.g., some of the indexing, etc). Predicates and projections could also 
be encapsulated in this new "StreamingCompactionTask", for example.

Here, we choose X and Y to be contained within one token range (perhaps 
considering the primary range of a node without vnodes, for example). This 
query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
operations via Spark (or other processing frameworks - ETL, etc). There are a 
few causes (e.g., inefficient paging).

 

Another approach would be an alternate storage format. For example, we might 
employ Parquet (just as an example) to store the same data as in the primary 
Cassandra storage (aka SSTables). This is akin to Global Indexes (an alternate 
storage of the same data optimized for a particular query). Then, Cassandra can 
choose to leverage this alternate storage for particular CQL queries (e.g., 
range scans).

These are just 2 suggestions to get the conversation going.

One thing to note is that it will be useful to have this storage segregated by 
token range so that when you extract via these mechanisms you do not get 
replications-factor numbers of copies of the data. That will certainly be an 
issue for some Spark operations (e.g., counting). Thus, we will want 
per-token-range storage (even for single disks), so this will likely leverage 
CASSANDRA-6696 (though, we'll want to also consider the single disk case).

It is also worth discussing what the success criteria is here. It is unlikely 
to be as fast as EDW or HDFS performance (though, that is still a good goal), 
but being within some percentage of that performance should be set as success. 
For example, 2x as long as doing bulk operations on HDFS with similar node 
count/size/etc.

  was:
This ticket is following on from the 2015 NGCC.  This ticket is designed to be 
a place for discussing and designing an approach to bulk reading.

The goal is to have a bulk reading path for Cassandra.  That is, a path 
optimized to grab a large portion of the data for a table (potentially all of 
it).  This is a core element in the Spark integration with Cassandra, and the 
speed at which Cassandra can deliver bulk data to Spark is limiting the 
performance of Spark-plus-Cassandra operations.  This is especially of 
importance as Cassandra will (likely) leverage Spark for internal operations 
(for example CASSANDRA-8234).

The core CQL to consider is the following:
SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
Token(partitionKey) <= Y

Here, we choose X and Y to be contained within one token range (perhaps 
considering the primary range of a node without vnodes, for example).  This 
query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
operations via Spark (or other processing frameworks - ETL, etc).  There are a 
few causes (e.g., inefficient paging).

There are a few approaches that could be considered.  First, we consider a new 
"Streaming Compaction" approach.  The key observation here is that a bulk read 
from Cassandra is a lot like a major compaction, though instead of outputting a 
new SSTable we would output CQL rows to a stream/socket/etc.  This would be 
similar to a CompactionTask, but would strip out some unnecessary things in 
there (e.g., some of the indexing, etc). Predicates and projections could also 
be encapsulated in this new "StreamingCompactionTask", for example.

Another approach would be an alternate storage format.  For example, we might 
employ Parquet (just as an example) to store the same data as in the primary 
Cassandra storage (aka 

[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2017-06-22 Thread Jeremy Hanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Hanna updated CASSANDRA-9259:

Component/s: (was: Materialized Views)

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 4.x
>
> Attachments: 256_vnodes.jpg, before_after.jpg, 
> bulk-read-benchmark.1.html, bulk-read-jfr-profiles.1.tar.gz, 
> bulk-read-jfr-profiles.2.tar.gz, no_vnodes.jpg, spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2017-06-22 Thread Jeremy Hanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Hanna updated CASSANDRA-9259:

Component/s: Materialized Views

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Materialized 
> Views, Streaming and Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 4.x
>
> Attachments: 256_vnodes.jpg, before_after.jpg, 
> bulk-read-benchmark.1.html, bulk-read-jfr-profiles.1.tar.gz, 
> bulk-read-jfr-profiles.2.tar.gz, no_vnodes.jpg, spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-07-29 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-9259:

Attachment: no_vnodes.jpg
256_vnodes.jpg

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: 256_vnodes.jpg, before_after.jpg, 
> bulk-read-benchmark.1.html, bulk-read-jfr-profiles.1.tar.gz, 
> bulk-read-jfr-profiles.2.tar.gz, no_vnodes.jpg, spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-07-29 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-9259:

Attachment: (was: 256_vnodes.jpg)

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: before_after.jpg, bulk-read-benchmark.1.html, 
> bulk-read-jfr-profiles.1.tar.gz, bulk-read-jfr-profiles.2.tar.gz, 
> spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-07-29 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-9259:

Attachment: (was: no_vnodes.jpg)

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: before_after.jpg, bulk-read-benchmark.1.html, 
> bulk-read-jfr-profiles.1.tar.gz, bulk-read-jfr-profiles.2.tar.gz, 
> spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-07-29 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-9259:

Attachment: before_after.jpg

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: 256_vnodes.jpg, before_after.jpg, 
> bulk-read-benchmark.1.html, bulk-read-jfr-profiles.1.tar.gz, 
> bulk-read-jfr-profiles.2.tar.gz, no_vnodes.jpg, spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-07-29 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-9259:

Attachment: (was: before_after.jpg)

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: 256_vnodes.jpg, before_after.jpg, 
> bulk-read-benchmark.1.html, bulk-read-jfr-profiles.1.tar.gz, 
> bulk-read-jfr-profiles.2.tar.gz, no_vnodes.jpg, spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-07-29 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-9259:

Attachment: no_vnodes.jpg
256_vnodes.jpg

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: 256_vnodes.jpg, before_after.jpg, 
> bulk-read-benchmark.1.html, bulk-read-jfr-profiles.1.tar.gz, 
> bulk-read-jfr-profiles.2.tar.gz, no_vnodes.jpg, spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-07-29 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-9259:

Attachment: (was: 256_vnodes.jpg)

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: before_after.jpg, bulk-read-benchmark.1.html, 
> bulk-read-jfr-profiles.1.tar.gz, bulk-read-jfr-profiles.2.tar.gz, 
> spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-07-29 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-9259:

Attachment: (was: no_vnodes.jpg)

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: before_after.jpg, bulk-read-benchmark.1.html, 
> bulk-read-jfr-profiles.1.tar.gz, bulk-read-jfr-profiles.2.tar.gz, 
> spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-07-29 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-9259:

Attachment: no_vnodes.jpg
before_after.jpg
256_vnodes.jpg

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: 256_vnodes.jpg, before_after.jpg, 
> bulk-read-benchmark.1.html, bulk-read-jfr-profiles.1.tar.gz, 
> bulk-read-jfr-profiles.2.tar.gz, no_vnodes.jpg, spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-07-29 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-9259:

Attachment: (was: no_vnodes.jpg)

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: 256_vnodes.jpg, before_after.jpg, 
> bulk-read-benchmark.1.html, bulk-read-jfr-profiles.1.tar.gz, 
> bulk-read-jfr-profiles.2.tar.gz, no_vnodes.jpg, spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-07-29 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-9259:

Attachment: (was: before_after.jpg)

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: bulk-read-benchmark.1.html, 
> bulk-read-jfr-profiles.1.tar.gz, bulk-read-jfr-profiles.2.tar.gz, 
> no_vnodes.jpg, spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-07-29 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-9259:

Attachment: (was: 256_vnodes.jpg)

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: bulk-read-benchmark.1.html, 
> bulk-read-jfr-profiles.1.tar.gz, bulk-read-jfr-profiles.2.tar.gz, 
> no_vnodes.jpg, spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-07-29 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-9259:

Attachment: no_vnodes.jpg
before_after.jpg
256_vnodes.jpg
spark_benchmark_raw_data.zip

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: 256_vnodes.jpg, before_after.jpg, 
> bulk-read-benchmark.1.html, bulk-read-jfr-profiles.1.tar.gz, 
> bulk-read-jfr-profiles.2.tar.gz, no_vnodes.jpg, spark_benchmark_raw_data.zip
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-03-18 Thread Stefania (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefania updated CASSANDRA-9259:

Attachment: bulk-read-benchmark.1.html
bulk-read-jfr-profiles.1.tar.gz
bulk-read-jfr-profiles.2.tar.gz

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Stefania
>Priority: Critical
> Fix For: 3.x
>
> Attachments: bulk-read-benchmark.1.html, 
> bulk-read-jfr-profiles.1.tar.gz, bulk-read-jfr-profiles.2.tar.gz
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2016-01-11 Thread Ariel Weisberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-9259:
--
Assignee: (was: Ariel Weisberg)

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Priority: Critical
> Fix For: 3.x
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2015-11-16 Thread Ariel Weisberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-9259:
--
Component/s: Streaming and Messaging
 Local Write-Read Paths
 CQL
 Compaction

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Ariel Weisberg
>Priority: Critical
> Fix For: 3.x
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2015-11-16 Thread Ariel Weisberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-9259:
--
Component/s: Testing

> Bulk Reading from Cassandra
> ---
>
> Key: CASSANDRA-9259
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Compaction, CQL, Local Write-Read Paths, Streaming and 
> Messaging, Testing
>Reporter:  Brian Hess
>Assignee: Ariel Weisberg
>Priority: Critical
> Fix For: 3.x
>
>
> This ticket is following on from the 2015 NGCC.  This ticket is designed to 
> be a place for discussing and designing an approach to bulk reading.
> The goal is to have a bulk reading path for Cassandra.  That is, a path 
> optimized to grab a large portion of the data for a table (potentially all of 
> it).  This is a core element in the Spark integration with Cassandra, and the 
> speed at which Cassandra can deliver bulk data to Spark is limiting the 
> performance of Spark-plus-Cassandra operations.  This is especially of 
> importance as Cassandra will (likely) leverage Spark for internal operations 
> (for example CASSANDRA-8234).
> The core CQL to consider is the following:
> SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey) > X AND 
> Token(partitionKey) <= Y
> Here, we choose X and Y to be contained within one token range (perhaps 
> considering the primary range of a node without vnodes, for example).  This 
> query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
> operations via Spark (or other processing frameworks - ETL, etc).  There are 
> a few causes (e.g., inefficient paging).
> There are a few approaches that could be considered.  First, we consider a 
> new "Streaming Compaction" approach.  The key observation here is that a bulk 
> read from Cassandra is a lot like a major compaction, though instead of 
> outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
> This would be similar to a CompactionTask, but would strip out some 
> unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
> projections could also be encapsulated in this new "StreamingCompactionTask", 
> for example.
> Another approach would be an alternate storage format.  For example, we might 
> employ Parquet (just as an example) to store the same data as in the primary 
> Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
> alternate storage of the same data optimized for a particular query).  Then, 
> Cassandra can choose to leverage this alternate storage for particular CQL 
> queries (e.g., range scans).
> These are just 2 suggestions to get the conversation going.
> One thing to note is that it will be useful to have this storage segregated 
> by token range so that when you extract via these mechanisms you do not get 
> replications-factor numbers of copies of the data.  That will certainly be an 
> issue for some Spark operations (e.g., counting).  Thus, we will want 
> per-token-range storage (even for single disks), so this will likely leverage 
> CASSANDRA-6696 (though, we'll want to also consider the single disk case).
> It is also worth discussing what the success criteria is here.  It is 
> unlikely to be as fast as EDW or HDFS performance (though, that is still a 
> good goal), but being within some percentage of that performance should be 
> set as success.  For example, 2x as long as doing bulk operations on HDFS 
> with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2015-08-11 Thread Jonathan Ellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-9259:
--
Priority: Critical  (was: Major)

 Bulk Reading from Cassandra
 ---

 Key: CASSANDRA-9259
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter:  Brian Hess
Assignee: Ariel Weisberg
Priority: Critical
 Fix For: 3.x


 This ticket is following on from the 2015 NGCC.  This ticket is designed to 
 be a place for discussing and designing an approach to bulk reading.
 The goal is to have a bulk reading path for Cassandra.  That is, a path 
 optimized to grab a large portion of the data for a table (potentially all of 
 it).  This is a core element in the Spark integration with Cassandra, and the 
 speed at which Cassandra can deliver bulk data to Spark is limiting the 
 performance of Spark-plus-Cassandra operations.  This is especially of 
 importance as Cassandra will (likely) leverage Spark for internal operations 
 (for example CASSANDRA-8234).
 The core CQL to consider is the following:
 SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey)  X AND 
 Token(partitionKey) = Y
 Here, we choose X and Y to be contained within one token range (perhaps 
 considering the primary range of a node without vnodes, for example).  This 
 query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
 operations via Spark (or other processing frameworks - ETL, etc).  There are 
 a few causes (e.g., inefficient paging).
 There are a few approaches that could be considered.  First, we consider a 
 new Streaming Compaction approach.  The key observation here is that a bulk 
 read from Cassandra is a lot like a major compaction, though instead of 
 outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
 This would be similar to a CompactionTask, but would strip out some 
 unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
 projections could also be encapsulated in this new StreamingCompactionTask, 
 for example.
 Another approach would be an alternate storage format.  For example, we might 
 employ Parquet (just as an example) to store the same data as in the primary 
 Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
 alternate storage of the same data optimized for a particular query).  Then, 
 Cassandra can choose to leverage this alternate storage for particular CQL 
 queries (e.g., range scans).
 These are just 2 suggestions to get the conversation going.
 One thing to note is that it will be useful to have this storage segregated 
 by token range so that when you extract via these mechanisms you do not get 
 replications-factor numbers of copies of the data.  That will certainly be an 
 issue for some Spark operations (e.g., counting).  Thus, we will want 
 per-token-range storage (even for single disks), so this will likely leverage 
 CASSANDRA-6696 (though, we'll want to also consider the single disk case).
 It is also worth discussing what the success criteria is here.  It is 
 unlikely to be as fast as EDW or HDFS performance (though, that is still a 
 good goal), but being within some percentage of that performance should be 
 set as success.  For example, 2x as long as doing bulk operations on HDFS 
 with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2015-08-11 Thread Jonathan Ellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-9259:
--
Issue Type: New Feature  (was: Improvement)

 Bulk Reading from Cassandra
 ---

 Key: CASSANDRA-9259
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter:  Brian Hess
Assignee: Ariel Weisberg
 Fix For: 3.x


 This ticket is following on from the 2015 NGCC.  This ticket is designed to 
 be a place for discussing and designing an approach to bulk reading.
 The goal is to have a bulk reading path for Cassandra.  That is, a path 
 optimized to grab a large portion of the data for a table (potentially all of 
 it).  This is a core element in the Spark integration with Cassandra, and the 
 speed at which Cassandra can deliver bulk data to Spark is limiting the 
 performance of Spark-plus-Cassandra operations.  This is especially of 
 importance as Cassandra will (likely) leverage Spark for internal operations 
 (for example CASSANDRA-8234).
 The core CQL to consider is the following:
 SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey)  X AND 
 Token(partitionKey) = Y
 Here, we choose X and Y to be contained within one token range (perhaps 
 considering the primary range of a node without vnodes, for example).  This 
 query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
 operations via Spark (or other processing frameworks - ETL, etc).  There are 
 a few causes (e.g., inefficient paging).
 There are a few approaches that could be considered.  First, we consider a 
 new Streaming Compaction approach.  The key observation here is that a bulk 
 read from Cassandra is a lot like a major compaction, though instead of 
 outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
 This would be similar to a CompactionTask, but would strip out some 
 unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
 projections could also be encapsulated in this new StreamingCompactionTask, 
 for example.
 Another approach would be an alternate storage format.  For example, we might 
 employ Parquet (just as an example) to store the same data as in the primary 
 Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
 alternate storage of the same data optimized for a particular query).  Then, 
 Cassandra can choose to leverage this alternate storage for particular CQL 
 queries (e.g., range scans).
 These are just 2 suggestions to get the conversation going.
 One thing to note is that it will be useful to have this storage segregated 
 by token range so that when you extract via these mechanisms you do not get 
 replications-factor numbers of copies of the data.  That will certainly be an 
 issue for some Spark operations (e.g., counting).  Thus, we will want 
 per-token-range storage (even for single disks), so this will likely leverage 
 CASSANDRA-6696 (though, we'll want to also consider the single disk case).
 It is also worth discussing what the success criteria is here.  It is 
 unlikely to be as fast as EDW or HDFS performance (though, that is still a 
 good goal), but being within some percentage of that performance should be 
 set as success.  For example, 2x as long as doing bulk operations on HDFS 
 with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2015-08-11 Thread Jonathan Ellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-9259:
--
Fix Version/s: 3.x

 Bulk Reading from Cassandra
 ---

 Key: CASSANDRA-9259
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter:  Brian Hess
Assignee: Ariel Weisberg
 Fix For: 3.x


 This ticket is following on from the 2015 NGCC.  This ticket is designed to 
 be a place for discussing and designing an approach to bulk reading.
 The goal is to have a bulk reading path for Cassandra.  That is, a path 
 optimized to grab a large portion of the data for a table (potentially all of 
 it).  This is a core element in the Spark integration with Cassandra, and the 
 speed at which Cassandra can deliver bulk data to Spark is limiting the 
 performance of Spark-plus-Cassandra operations.  This is especially of 
 importance as Cassandra will (likely) leverage Spark for internal operations 
 (for example CASSANDRA-8234).
 The core CQL to consider is the following:
 SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey)  X AND 
 Token(partitionKey) = Y
 Here, we choose X and Y to be contained within one token range (perhaps 
 considering the primary range of a node without vnodes, for example).  This 
 query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
 operations via Spark (or other processing frameworks - ETL, etc).  There are 
 a few causes (e.g., inefficient paging).
 There are a few approaches that could be considered.  First, we consider a 
 new Streaming Compaction approach.  The key observation here is that a bulk 
 read from Cassandra is a lot like a major compaction, though instead of 
 outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
 This would be similar to a CompactionTask, but would strip out some 
 unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
 projections could also be encapsulated in this new StreamingCompactionTask, 
 for example.
 Another approach would be an alternate storage format.  For example, we might 
 employ Parquet (just as an example) to store the same data as in the primary 
 Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
 alternate storage of the same data optimized for a particular query).  Then, 
 Cassandra can choose to leverage this alternate storage for particular CQL 
 queries (e.g., range scans).
 These are just 2 suggestions to get the conversation going.
 One thing to note is that it will be useful to have this storage segregated 
 by token range so that when you extract via these mechanisms you do not get 
 replications-factor numbers of copies of the data.  That will certainly be an 
 issue for some Spark operations (e.g., counting).  Thus, we will want 
 per-token-range storage (even for single disks), so this will likely leverage 
 CASSANDRA-6696 (though, we'll want to also consider the single disk case).
 It is also worth discussing what the success criteria is here.  It is 
 unlikely to be as fast as EDW or HDFS performance (though, that is still a 
 good goal), but being within some percentage of that performance should be 
 set as success.  For example, 2x as long as doing bulk operations on HDFS 
 with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-9259) Bulk Reading from Cassandra

2015-07-24 Thread Jonathan Ellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-9259:
--
Assignee: Ariel Weisberg

 Bulk Reading from Cassandra
 ---

 Key: CASSANDRA-9259
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9259
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter:  Brian Hess
Assignee: Ariel Weisberg

 This ticket is following on from the 2015 NGCC.  This ticket is designed to 
 be a place for discussing and designing an approach to bulk reading.
 The goal is to have a bulk reading path for Cassandra.  That is, a path 
 optimized to grab a large portion of the data for a table (potentially all of 
 it).  This is a core element in the Spark integration with Cassandra, and the 
 speed at which Cassandra can deliver bulk data to Spark is limiting the 
 performance of Spark-plus-Cassandra operations.  This is especially of 
 importance as Cassandra will (likely) leverage Spark for internal operations 
 (for example CASSANDRA-8234).
 The core CQL to consider is the following:
 SELECT a, b, c FROM myKs.myTable WHERE Token(partitionKey)  X AND 
 Token(partitionKey) = Y
 Here, we choose X and Y to be contained within one token range (perhaps 
 considering the primary range of a node without vnodes, for example).  This 
 query pushes 50K-100K rows/sec, which is not very fast if we are doing bulk 
 operations via Spark (or other processing frameworks - ETL, etc).  There are 
 a few causes (e.g., inefficient paging).
 There are a few approaches that could be considered.  First, we consider a 
 new Streaming Compaction approach.  The key observation here is that a bulk 
 read from Cassandra is a lot like a major compaction, though instead of 
 outputting a new SSTable we would output CQL rows to a stream/socket/etc.  
 This would be similar to a CompactionTask, but would strip out some 
 unnecessary things in there (e.g., some of the indexing, etc). Predicates and 
 projections could also be encapsulated in this new StreamingCompactionTask, 
 for example.
 Another approach would be an alternate storage format.  For example, we might 
 employ Parquet (just as an example) to store the same data as in the primary 
 Cassandra storage (aka SSTables).  This is akin to Global Indexes (an 
 alternate storage of the same data optimized for a particular query).  Then, 
 Cassandra can choose to leverage this alternate storage for particular CQL 
 queries (e.g., range scans).
 These are just 2 suggestions to get the conversation going.
 One thing to note is that it will be useful to have this storage segregated 
 by token range so that when you extract via these mechanisms you do not get 
 replications-factor numbers of copies of the data.  That will certainly be an 
 issue for some Spark operations (e.g., counting).  Thus, we will want 
 per-token-range storage (even for single disks), so this will likely leverage 
 CASSANDRA-6696 (though, we'll want to also consider the single disk case).
 It is also worth discussing what the success criteria is here.  It is 
 unlikely to be as fast as EDW or HDFS performance (though, that is still a 
 good goal), but being within some percentage of that performance should be 
 set as success.  For example, 2x as long as doing bulk operations on HDFS 
 with similar node count/size/etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)