GitHub user kiszk reopened a pull request:

    https://github.com/apache/spark/pull/19601

    [SPARK-22383][SQL] Generate code to directly get value of primitive type 
array from ColumnVector for table cache

    ## What changes were proposed in this pull request?
    
    This PR generates the Java code to directly get a value for a primitive 
type array in ColumnVector without using an iterator for table cache (e.g. 
dataframe.cache). This PR improves runtime performance by eliminating data copy 
from column-oriented storage to InternalRow in a SpecificColumnarIterator 
iterator for primitive type. This is a follow-up PR of #18747.
    
    The idea of this implementation is to add `ColumnVector.UnsafeArray` to 
keep `UnsafeArrayData` for an array in addition to `ColumnVector.Array` that 
keeps `ColumnVector` for a Java primitive array for an array.
    
    Benchmark result: **21.4x**
    
    ```
    OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 
4.4.0-22-generic
    Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
    
    Filter for int primitive array with cache: Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    InternalRow codegen                           1368 / 1887         23.0      
    43.5       1.0X
    
    Filter for int primitive array with cache: Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    ColumnVector codegen                            64 /   90        488.1      
     2.0       1.0X
    
    ```
    
    Benchmark program
    ```
      intArrayBenchmark(sqlContext, 1024 * 1024 * 20)
      def intArrayBenchmark(sqlContext: SQLContext, values: Int, iters: Int = 
20): Unit = {
        import sqlContext.implicits._
        val benchmarkPT = new Benchmark("Filter for int primitive array with 
cache", values, iters)
        val df = sqlContext.sparkContext.parallelize(0 to ROWS, 1)
                           .map(i => Array.range(i, values)).toDF("a").cache
        df.count  // force to create df.cache
        val str = "ColumnVector"
        var c: Long = 0
        benchmarkPT.addCase(s"$str codegen") { iter =>
          c += df.filter(s"a[${values/2}] % 10 = 0").count
        }
        benchmarkPT.run()
        df.unpersist(true)
        System.gc()
      }
    ```
    
    ## How was this patch tested?
    
    Added test cases into `ColumnVectorSuite`, `DataFrameTungstenSuite`, and 
`WholeStageCodegenSuite`


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kiszk/spark SPARK-22383

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19601.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19601
    
----
commit 12dd996b134cbd6aeb83d70dc14f50fc2516e6ea
Author: Kazuaki Ishizaki <[email protected]>
Date:   2017-10-29T03:28:06Z

    initial commit

commit 7fa67d1c5d259dabe34e35427c9f69746ac82260
Author: Kazuaki Ishizaki <[email protected]>
Date:   2017-10-30T09:48:46Z

    add UnsafeColumnVector to support array for table cache

commit 05ec886fb93cb0f05137f3b69bcd20b20455225b
Author: Kazuaki Ishizaki <[email protected]>
Date:   2017-10-30T17:43:40Z

    fix faiulres in CacheTableSuite, HiveCompatibilitySuite, and HiveQuerySuite

commit 761516a1e3234472a92493a9375e1db79998a1b0
Author: Kazuaki Ishizaki <[email protected]>
Date:   2017-10-30T17:44:05Z

    remove wrong assert to fix failures

commit 98f764fea0c8ee256395d285b7b69d62797d3e93
Author: Kazuaki Ishizaki <[email protected]>
Date:   2017-11-05T06:57:47Z

    avoid to override ColumnVector.getArray()

commit 5477d5b3bc2362d8ab861af1601989dfb1fefa79
Author: Kazuaki Ishizaki <[email protected]>
Date:   2017-11-05T15:53:24Z

    fix test failures

commit 029af07d73cba3beafe0c591ef4c14e18bfe4dd1
Author: Kazuaki Ishizaki <[email protected]>
Date:   2017-11-10T16:06:20Z

    Remove ColumnVector.putUnsafeData()

commit 94c6a97017528a068214117ce061b1ce9dd053b3
Author: Kazuaki Ishizaki <[email protected]>
Date:   2017-11-15T18:32:30Z

    rebase with master

commit 11af85cf0d13f6cff2f3c246afc1e664de8e3e41
Author: Kazuaki Ishizaki <[email protected]>
Date:   2017-11-16T07:18:17Z

    address review comment

commit 63d9d576799d057646e991326c38b5fdb3a9f361
Author: Kazuaki Ishizaki <[email protected]>
Date:   2017-11-16T07:54:19Z

    fix compilation error

commit 9b6b890b0444f3a20e73691528b59ad21edb07b8
Author: Kazuaki Ishizaki <[email protected]>
Date:   2017-11-21T18:51:00Z

    fix failures of rebase

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to