GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/19603

    [SPARK-22385][SQL] MapObjects should not access list element by index

    ## What changes were proposed in this pull request?
    
    This issue was discovered and investigated by Ohad Raviv and Sean Owen in 
https://issues.apache.org/jira/browse/SPARK-21657. The input data of 
`MapObjects` may be a `List` which has O(n) complexity for accessing by index. 
When converting input data to catalyst array, `MapObjects` gets element by 
index in each loop, and results to bad performance.
    
    This PR fixes this issue by accessing elements via Iterator.
    
    ## How was this patch tested?
    
    using the test script in https://issues.apache.org/jira/browse/SPARK-21657 
    ```
    val BASE = 100000000
    val N = 100000
    val df = sc.parallelize(List(("1234567890", (BASE to (BASE+N)).map(x => 
(x.toString, (x+1).toString, (x+2).toString, (x+3).toString)).toList 
))).toDF("c1", "c_arr")
    spark.time(df.queryExecution.toRdd.foreach(_ => ()))
    ```
    
    We can see 50x speed up.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark map-objects

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19603.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19603
    
----
commit 6cb4fca89e83172407114037d3a447ae6d941f0a
Author: Wenchen Fan <[email protected]>
Date:   2017-10-29T11:27:21Z

    MapObjects should not access list element by index

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to