GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/19603
[SPARK-22385][SQL] MapObjects should not access list element by index
## What changes were proposed in this pull request?
This issue was discovered and investigated by Ohad Raviv and Sean Owen in
https://issues.apache.org/jira/browse/SPARK-21657. The input data of
`MapObjects` may be a `List` which has O(n) complexity for accessing by index.
When converting input data to catalyst array, `MapObjects` gets element by
index in each loop, and results to bad performance.
This PR fixes this issue by accessing elements via Iterator.
## How was this patch tested?
using the test script in https://issues.apache.org/jira/browse/SPARK-21657
```
val BASE = 100000000
val N = 100000
val df = sc.parallelize(List(("1234567890", (BASE to (BASE+N)).map(x =>
(x.toString, (x+1).toString, (x+2).toString, (x+3).toString)).toList
))).toDF("c1", "c_arr")
spark.time(df.queryExecution.toRdd.foreach(_ => ()))
```
We can see 50x speed up.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark map-objects
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19603.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19603
----
commit 6cb4fca89e83172407114037d3a447ae6d941f0a
Author: Wenchen Fan <[email protected]>
Date: 2017-10-29T11:27:21Z
MapObjects should not access list element by index
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]