[GitHub] spark pull request #20172: [SPARK-22979][PYTHON][SQL] Avoid per-record type ...

HyukjinKwon Sat, 06 Jan 2018 00:43:03 -0800

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/20172


    [SPARK-22979][PYTHON][SQL] Avoid per-record type dispatch in Python data 
conversion (EvaluatePython.fromJava)

    ## What changes were proposed in this pull request?
    
    Seems we can avoid type dispatch for each value when Java objection (from 
Pyrolite) -> Spark's internal data format because we know the schema ahead.
    
    I manually performed the benchmark as below:
    
    ```scala
      test("EvaluatePython.fromJava / EvaluatePython.makeFromJava") {
        val numRows = 1000 * 1000
        val numFields = 30
    
        val random = new Random(System.nanoTime())
        val types = Array(
          BooleanType, ByteType, FloatType, DoubleType, IntegerType, LongType, 
ShortType,
          DecimalType.ShortDecimal, DecimalType.IntDecimal, 
DecimalType.ByteDecimal,
          DecimalType.FloatDecimal, DecimalType.LongDecimal, new DecimalType(5, 
2),
          new DecimalType(12, 2), new DecimalType(30, 10), CalendarIntervalType)
        val schema = RandomDataGenerator.randomSchema(random, numFields, types)
        val rows = mutable.ArrayBuffer.empty[Array[Any]]
        var i = 0
        while (i < numRows) {
          val row = RandomDataGenerator.randomRow(random, schema)
          rows += row.toSeq.toArray
          i += 1
        }
    
        val benchmark = new Benchmark("EvaluatePython.fromJava / 
EvaluatePython.makeFromJava", numRows)
        benchmark.addCase("Before - EvaluatePython.fromJava", 3) { _ =>
          var i = 0
          while (i < numRows) {
            EvaluatePython.fromJava(rows(i), schema)
            i += 1
          }
        }
    
        benchmark.addCase("After - EvaluatePython.makeFromJava", 3) { _ =>
          val fromJava = EvaluatePython.makeFromJava(schema)
          var i = 0
          while (i < numRows) {
            fromJava(rows(i))
            i += 1
          }
        }
    
        benchmark.run()
      }
    ```
    
    
    ```
    Running benchmark: EvaluatePython.fromJava / EvaluatePython.makeFromJava
      Running case: Before - EvaluatePython.fromJava
      Stopped after 3 iterations, 4036 ms
      Running case: After - EvaluatePython.makeFromJava
      Stopped after 3 iterations, 1945 ms
    
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14 on Mac OS X 10.12.6
    Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
    
    EvaluatePython.fromJava / EvaluatePython.makeFromJava: Best/Avg Time(ms)    
Rate(M/s)   Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    Before - EvaluatePython.fromJava              1265 / 1346          0.8      
  1264.8       1.0X
    After - EvaluatePython.makeFromJava            571 /  649          1.8      
   570.8       2.2X
    ```
    
    If the structure is nested, I think the advantage should be larger than 
this.
    
    ## How was this patch tested?
    
    Existing tests should cover this. Also, I manually checked if the values 
from before / after are actually same via `assert` when performing the 
benchmarks.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark type-dispatch-python-eval

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20172.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20172
    
----
commit 83c9b58670ab01c8abc11ffca08938b0a8189aee
Author: hyukjinkwon <gurwls223@...>
Date:   2018-01-06T08:27:56Z

    Avoid per-record type dispatch in Python data conversion 
(EvaluatePython.fromJava)

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20172: [SPARK-22979][PYTHON][SQL] Avoid per-record type ...

Reply via email to