Tin Vu created SPARK-23797: ------------------------------ Summary: SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto Key: SPARK-23797 URL: https://issues.apache.org/jira/browse/SPARK-23797 Project: Spark Issue Type: Bug Components: Optimizer, Spark Submit, SQL Affects Versions: 2.3.0 Reporter: Tin Vu
I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My experimental setup: * TPCDS dataset with scale factor 100 (size 100GB). * Spark, Drill, Presto have a same numberĀ of workers: 12. * Each worked has same allocated amount of memory: 4GB. * Data is stored by Hive with ORC format. I executed a very simple SQL query: "SELECT * from table_name" The issue is that for some small size tables (even table with few dozen of records), SparkSQL still required about 7-8 seconds to finish, while Drill and Presto only needed less than 1 second. For other large tables with billions records, SparkSQL performance was reasonable when it required 20-30 seconds to scan the whole table. Do you have any idea or reasonable explanation for this issue? Thanks, -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org