[ https://issues.apache.org/jira/browse/CASSANDRA-11542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15263830#comment-15263830 ]
Stefania commented on CASSANDRA-11542: -------------------------------------- Thank you for the suggestions [~rspitzer]. Switching to case classes for the RDD tests has actually made things much worse for streaming. For example for schema 1: ||Test||Time||Std. Dev|| | parquet_rdd| 2.67| 0.50| | parquet_df| 2.69| 0.66| | csv_rdd| 5.20| 0.58| | csv_df| 12.37| 0.53| | cassandra_rdd| 49.03| 5.46| |cassandra_rdd_stream| 47.49| 3.84| | cassandra_df| 27.78| 0.74| | cassandra_df_stream| 19.35| 1.51| I've also fixed a bug with the benchmark: previously sstables were only compacted and flushed on the spark master, not on all nodes. Further, the OS page cache might also not necessarily have been flushed on all nodes. Here is a repeat of data for schema 1 with this problem addressed: ||Test||Time||Std. Dev|| | parquet_rdd| 4.81| 0.31| | parquet_df| 4.77| 0.78| | csv_rdd| 7.59| 0.22| | csv_df| 13.09| 0.41| | cassandra_rdd| 45.13| 0.55| |cassandra_rdd_stream| 41.64| 0.37| | cassandra_df| 36.15| 11.70| | cassandra_df_stream| 22.55| 3.36| In terms of IO usage, dstat shows less than 10MB per sec for Cassandra compared to 110 MB for HDFS. Regarding the degradation of RDD streaming with case classes, I suspect the way I implemented the RDD iterator might be inefficient, could you take a look [here|https://github.com/datastax/spark-cassandra-connector/compare/master...stef1927:9259#diff-fecec01aca0cc6ed91526423b292eceaR12]? We receive an iterator of futures from the driver, one per page. As soon as the future completes we return the rows, which then get converted to case classes. I think that somehow we should convert page rows in parallel to speed things up and take advantage of streaming. [~slebresne] : noted, thank you! I should be able to start CASSANDRA-11521 next week, and continue this investigation in parallel. > Create a benchmark to compare HDFS and Cassandra bulk read times > ---------------------------------------------------------------- > > Key: CASSANDRA-11542 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11542 > Project: Cassandra > Issue Type: Sub-task > Components: Testing > Reporter: Stefania > Assignee: Stefania > Fix For: 3.x > > Attachments: spark-load-perf-results-001.zip, > spark-load-perf-results-002.zip > > > I propose creating a benchmark for comparing Cassandra and HDFS bulk reading > performance. Simple Spark queries will be performed on data stored in HDFS or > Cassandra, and the entire duration will be measured. An example query would > be the max or min of a column or a count\(*\). > This benchmark should allow determining the impact of: > * partition size > * number of clustering columns > * number of value columns (cells) -- This message was sent by Atlassian JIRA (v6.3.4#6332)