[jira] [Commented] (CASSANDRA-11542) Create a benchmark to compare HDFS and Cassandra bulk read times

Russell Alexander Spitzer (JIRA) Thu, 28 Apr 2016 22:10:34 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15263553#comment-15263553
 ]


Russell Alexander Spitzer commented on CASSANDRA-11542:
-------------------------------------------------------

You may also want to do tests reading into CaseClasses rather than 
CassandraRows, 

{code}
case class RowName( col:Type, col2: type, ....)
sc.cassandraTable[RowName]{code}

This may explain some of the difference between RDD and DataFrame read times as 
Dataframes (SqlRows vs CassandraRows) read into a different format than RDDs by 
default and case classes should be much more efficient than the map based 
CassandraRows. In addition I think the parquet versions are able to skip full 
counts (because of the metadata) but i'm not really sure about that which may 
give them the advantage over CSV ... Again not sure it could just be the 
compression of repeated values 

> Create a benchmark to compare HDFS and Cassandra bulk read times
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-11542
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11542
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Testing
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 3.x
>
>         Attachments: spark-load-perf-results-001.zip, 
> spark-load-perf-results-002.zip
>
>
> I propose creating a benchmark for comparing Cassandra and HDFS bulk reading 
> performance. Simple Spark queries will be performed on data stored in HDFS or 
> Cassandra, and the entire duration will be measured. An example query would 
> be the max or min of a column or a count\(*\).
> This benchmark should allow determining the impact of:
> * partition size
> * number of clustering columns
> * number of value columns (cells)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11542) Create a benchmark to compare HDFS and Cassandra bulk read times

Reply via email to