[ 
https://issues.apache.org/jira/browse/CASSANDRA-11542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15263830#comment-15263830
 ] 

Stefania commented on CASSANDRA-11542:
--------------------------------------

Thank you for the suggestions [~rspitzer]. Switching to case classes for the 
RDD tests has actually made things much worse for streaming. 

For example for schema 1:

||Test||Time||Std. Dev||
|         parquet_rdd|    2.67|    0.50|
|          parquet_df|    2.69|    0.66|
|             csv_rdd|    5.20|    0.58|
|              csv_df|   12.37|    0.53|
|       cassandra_rdd|   49.03|    5.46|
|cassandra_rdd_stream|   47.49|    3.84|
|        cassandra_df|   27.78|    0.74|
| cassandra_df_stream|   19.35|    1.51|

I've also fixed a bug with the benchmark: previously sstables were only 
compacted and flushed on the spark master, not on all nodes. Further, the OS 
page cache might also not necessarily have been flushed on all nodes. Here is a 
repeat of data for schema 1 with this problem addressed:

||Test||Time||Std. Dev||
|         parquet_rdd|    4.81|    0.31|
|          parquet_df|    4.77|    0.78|
|             csv_rdd|    7.59|    0.22|
|              csv_df|   13.09|    0.41|
|       cassandra_rdd|   45.13|    0.55|
|cassandra_rdd_stream|   41.64|    0.37|
|        cassandra_df|   36.15|   11.70|
| cassandra_df_stream|   22.55|    3.36|

In terms of IO usage, dstat shows less than 10MB per sec for Cassandra compared 
to 110 MB for HDFS.  

Regarding the degradation of RDD streaming with case classes, I suspect the way 
I implemented the RDD iterator might be inefficient, could you take a look 
[here|https://github.com/datastax/spark-cassandra-connector/compare/master...stef1927:9259#diff-fecec01aca0cc6ed91526423b292eceaR12]?
 We receive an iterator of futures from the driver, one per page. As soon as 
the future completes we return the rows, which then get converted to case 
classes. I think that somehow we should convert page rows in parallel to speed 
things up and take advantage of streaming.

[~slebresne] : noted, thank you! I should be able to start CASSANDRA-11521 next 
week, and continue this investigation in parallel. 


> Create a benchmark to compare HDFS and Cassandra bulk read times
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-11542
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11542
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Testing
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 3.x
>
>         Attachments: spark-load-perf-results-001.zip, 
> spark-load-perf-results-002.zip
>
>
> I propose creating a benchmark for comparing Cassandra and HDFS bulk reading 
> performance. Simple Spark queries will be performed on data stored in HDFS or 
> Cassandra, and the entire duration will be measured. An example query would 
> be the max or min of a column or a count\(*\).
> This benchmark should allow determining the impact of:
> * partition size
> * number of clustering columns
> * number of value columns (cells)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to