[jira] [Updated] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

Dongjoon Hyun (JIRA) Thu, 06 Dec 2018 22:58:09 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dongjoon Hyun updated SPARK-16196:
----------------------------------
    Target Version/s:   (was: 2.5.0)

> Optimize in-memory scan performance using ColumnarBatches
> ---------------------------------------------------------
>
>                 Key: SPARK-16196
>                 URL: https://issues.apache.org/jira/browse/SPARK-16196
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Andrew Or
>            Assignee: Andrew Or
>            Priority: Major
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 10000) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

Reply via email to