GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/20172
[SPARK-22979][PYTHON][SQL] Avoid per-record type dispatch in Python data
conversion (EvaluatePython.fromJava)
## What changes were proposed in this pull request?
Seems we can avoid type dispatch for each value when Java objection (from
Pyrolite) -> Spark's internal data format because we know the schema ahead.
I manually performed the benchmark as below:
```scala
test("EvaluatePython.fromJava / EvaluatePython.makeFromJava") {
val numRows = 1000 * 1000
val numFields = 30
val random = new Random(System.nanoTime())
val types = Array(
BooleanType, ByteType, FloatType, DoubleType, IntegerType, LongType,
ShortType,
DecimalType.ShortDecimal, DecimalType.IntDecimal,
DecimalType.ByteDecimal,
DecimalType.FloatDecimal, DecimalType.LongDecimal, new DecimalType(5,
2),
new DecimalType(12, 2), new DecimalType(30, 10), CalendarIntervalType)
val schema = RandomDataGenerator.randomSchema(random, numFields, types)
val rows = mutable.ArrayBuffer.empty[Array[Any]]
var i = 0
while (i < numRows) {
val row = RandomDataGenerator.randomRow(random, schema)
rows += row.toSeq.toArray
i += 1
}
val benchmark = new Benchmark("EvaluatePython.fromJava /
EvaluatePython.makeFromJava", numRows)
benchmark.addCase("Before - EvaluatePython.fromJava", 3) { _ =>
var i = 0
while (i < numRows) {
EvaluatePython.fromJava(rows(i), schema)
i += 1
}
}
benchmark.addCase("After - EvaluatePython.makeFromJava", 3) { _ =>
val fromJava = EvaluatePython.makeFromJava(schema)
var i = 0
while (i < numRows) {
fromJava(rows(i))
i += 1
}
}
benchmark.run()
}
```
```
Running benchmark: EvaluatePython.fromJava / EvaluatePython.makeFromJava
Running case: Before - EvaluatePython.fromJava
Stopped after 3 iterations, 4036 ms
Running case: After - EvaluatePython.makeFromJava
Stopped after 3 iterations, 1945 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
EvaluatePython.fromJava / EvaluatePython.makeFromJava: Best/Avg Time(ms)
Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
Before - EvaluatePython.fromJava 1265 / 1346 0.8
1264.8 1.0X
After - EvaluatePython.makeFromJava 571 / 649 1.8
570.8 2.2X
```
If the structure is nested, I think the advantage should be larger than
this.
## How was this patch tested?
Existing tests should cover this. Also, I manually checked if the values
from before / after are actually same via `assert` when performing the
benchmarks.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark type-dispatch-python-eval
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20172.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20172
----
commit 83c9b58670ab01c8abc11ffca08938b0a8189aee
Author: hyukjinkwon <gurwls223@...>
Date: 2018-01-06T08:27:56Z
Avoid per-record type dispatch in Python data conversion
(EvaluatePython.fromJava)
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]