[jira] [Comment Edited] (CASSANDRA-11542) Create a benchmark to compare HDFS and Cassandra bulk read times

Russell Alexander Spitzer (JIRA) Fri, 29 Apr 2016 14:05:26 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264698#comment-15264698
 ]


Russell Alexander Spitzer edited comment on CASSANDRA-11542 at 4/29/16 9:04 PM:
--------------------------------------------------------------------------------

-Hmm i'm a little confused that case classes don't help but Dataframes do...-  
https://datastax-oss.atlassian.net/browse/SPARKC-373 Saw that we are always 
doing the conversion to CassandraRow with RDDs, dataframes go directly to the 
internal SQL Type. 

The code you presented looks good to me, there is the potential issue of 
blocking on resultsets that take a long time to complete while other 
result-sets are already on the driver but i'm not sure if this is a big deal. 
Do you have any idea of the parallelization in these test? How many partitions 
are the different runs generating?




was (Author: rspitzer):
Hmm i'm a little confused that case classes don't help but Dataframes do... The 
code you presented looks good to me, there is the potential issue of blocking 
on resultsets that take a long time to complete while other result-sets are 
already on the driver but i'm not sure if this is a big deal. Do you have any 
idea of the parallelization in these test? How many partitions are the 
different runs generating?

> Create a benchmark to compare HDFS and Cassandra bulk read times
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-11542
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11542
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Testing
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 3.x
>
>         Attachments: spark-load-perf-results-001.zip, 
> spark-load-perf-results-002.zip
>
>
> I propose creating a benchmark for comparing Cassandra and HDFS bulk reading 
> performance. Simple Spark queries will be performed on data stored in HDFS or 
> Cassandra, and the entire duration will be measured. An example query would 
> be the max or min of a column or a count\(*\).
> This benchmark should allow determining the impact of:
> * partition size
> * number of clustering columns
> * number of value columns (cells)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-11542) Create a benchmark to compare HDFS and Cassandra bulk read times

Reply via email to