GitHub user kiszk reopened a pull request:
https://github.com/apache/spark/pull/19601
[SPARK-22383][SQL] Generate code to directly get value of primitive type
array from ColumnVector for table cache
## What changes were proposed in this pull request?
This PR generates the Java code to directly get a value for a primitive
type array in ColumnVector without using an iterator for table cache (e.g.
dataframe.cache). This PR improves runtime performance by eliminating data copy
from column-oriented storage to InternalRow in a SpecificColumnarIterator
iterator for primitive type. This is a follow-up PR of #18747.
The idea of this implementation is to add `ColumnVector.UnsafeArray` to
keep `UnsafeArrayData` for an array in addition to `ColumnVector.Array` that
keeps `ColumnVector` for a Java primitive array for an array.
Benchmark result: **21.4x**
```
OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux
4.4.0-22-generic
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
Filter for int primitive array with cache: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
InternalRow codegen 1368 / 1887 23.0
43.5 1.0X
Filter for int primitive array with cache: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ColumnVector codegen 64 / 90 488.1
2.0 1.0X
```
Benchmark program
```
intArrayBenchmark(sqlContext, 1024 * 1024 * 20)
def intArrayBenchmark(sqlContext: SQLContext, values: Int, iters: Int =
20): Unit = {
import sqlContext.implicits._
val benchmarkPT = new Benchmark("Filter for int primitive array with
cache", values, iters)
val df = sqlContext.sparkContext.parallelize(0 to ROWS, 1)
.map(i => Array.range(i, values)).toDF("a").cache
df.count // force to create df.cache
val str = "ColumnVector"
var c: Long = 0
benchmarkPT.addCase(s"$str codegen") { iter =>
c += df.filter(s"a[${values/2}] % 10 = 0").count
}
benchmarkPT.run()
df.unpersist(true)
System.gc()
}
```
## How was this patch tested?
Added test cases into `ColumnVectorSuite`, `DataFrameTungstenSuite`, and
`WholeStageCodegenSuite`
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kiszk/spark SPARK-22383
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19601.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19601
----
commit 12dd996b134cbd6aeb83d70dc14f50fc2516e6ea
Author: Kazuaki Ishizaki <[email protected]>
Date: 2017-10-29T03:28:06Z
initial commit
commit 7fa67d1c5d259dabe34e35427c9f69746ac82260
Author: Kazuaki Ishizaki <[email protected]>
Date: 2017-10-30T09:48:46Z
add UnsafeColumnVector to support array for table cache
commit 05ec886fb93cb0f05137f3b69bcd20b20455225b
Author: Kazuaki Ishizaki <[email protected]>
Date: 2017-10-30T17:43:40Z
fix faiulres in CacheTableSuite, HiveCompatibilitySuite, and HiveQuerySuite
commit 761516a1e3234472a92493a9375e1db79998a1b0
Author: Kazuaki Ishizaki <[email protected]>
Date: 2017-10-30T17:44:05Z
remove wrong assert to fix failures
commit 98f764fea0c8ee256395d285b7b69d62797d3e93
Author: Kazuaki Ishizaki <[email protected]>
Date: 2017-11-05T06:57:47Z
avoid to override ColumnVector.getArray()
commit 5477d5b3bc2362d8ab861af1601989dfb1fefa79
Author: Kazuaki Ishizaki <[email protected]>
Date: 2017-11-05T15:53:24Z
fix test failures
commit 029af07d73cba3beafe0c591ef4c14e18bfe4dd1
Author: Kazuaki Ishizaki <[email protected]>
Date: 2017-11-10T16:06:20Z
Remove ColumnVector.putUnsafeData()
commit 94c6a97017528a068214117ce061b1ce9dd053b3
Author: Kazuaki Ishizaki <[email protected]>
Date: 2017-11-15T18:32:30Z
rebase with master
commit 11af85cf0d13f6cff2f3c246afc1e664de8e3e41
Author: Kazuaki Ishizaki <[email protected]>
Date: 2017-11-16T07:18:17Z
address review comment
commit 63d9d576799d057646e991326c38b5fdb3a9f361
Author: Kazuaki Ishizaki <[email protected]>
Date: 2017-11-16T07:54:19Z
fix compilation error
commit 9b6b890b0444f3a20e73691528b59ad21edb07b8
Author: Kazuaki Ishizaki <[email protected]>
Date: 2017-11-21T18:51:00Z
fix failures of rebase
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]