Andrew Or created SPARK-16196: --------------------------------- Summary: Optimize in-memory scan performance using ColumnarBatches Key: SPARK-16196 URL: https://issues.apache.org/jira/browse/SPARK-16196 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or
A simple benchmark such as the following reveals inefficiencies in the existing in-memory scan implementation: {code} spark.range(N) .selectExpr("id", "floor(rand() * 10000) as k") .createOrReplaceTempView("test") val ds = spark.sql("select count(k), count(id) from test").cache() ds.collect() ds.collect() {code} There are many reasons why caching is slow. The biggest is that compression takes a long time. The second is that there are a lot of virtual function calls in this hot code path since the rows are processed using iterators. Further, the rows are converted to and from ByteBuffers, which are slow to read in general. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org