[
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-16196:
----------------------------------
Target Version/s: (was: 2.5.0)
> Optimize in-memory scan performance using ColumnarBatches
> ---------------------------------------------------------
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Andrew Or
> Assignee: Andrew Or
> Priority: Major
>
> A simple benchmark such as the following reveals inefficiencies in the
> existing in-memory scan implementation:
> {code}
> spark.range(N)
> .selectExpr("id", "floor(rand() * 10000) as k")
> .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression
> takes a long time. The second is that there are a lot of virtual function
> calls in this hot code path since the rows are processed using iterators.
> Further, the rows are converted to and from ByteBuffers, which are slow to
> read in general.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]