GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/22395

    [SPARK-16323][SQ] Add IntegralDivide expression

    ## What changes were proposed in this pull request?
    
    The PR takes over #14036 and it introduces a new expression 
`IntegralDivide` in order to avoid the several unneded cast added previously.
    
    In order to prove the performance gain, the following benchmark has been 
run:
    
    ```
      test("Benchmark IntegralDivide") {
        val r = new scala.util.Random(91)
        val nData = 1000000
        val testDataInt = (1 to nData).map(_ => (r.nextInt(), r.nextInt()))
        val testDataLong = (1 to nData).map(_ => (r.nextLong(), r.nextLong()))
        val testDataShort = (1 to nData).map(_ => (r.nextInt().toShort, 
r.nextInt().toShort))
    
        // old code
        val oldExprsInt = testDataInt.map(x =>
          Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), 
DoubleType)), LongType))
        val oldExprsLong = testDataLong.map(x =>
          Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), 
DoubleType)), LongType))
        val oldExprsShort = testDataShort.map(x =>
          Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), 
DoubleType)), LongType))
    
        // new code
        val newExprsInt = testDataInt.map(x => IntegralDivide(x._1, x._2))
        val newExprsLong = testDataLong.map(x => IntegralDivide(x._1, x._2))
        val newExprsShort = testDataShort.map(x => IntegralDivide(x._1, x._2))
    
    
        Seq(("Long", "old", oldExprsLong),
          ("Long", "new", newExprsLong),
          ("Int", "old", oldExprsInt),
          ("Int", "new", newExprsShort),
          ("Short", "old", oldExprsShort),
          ("Short", "new", oldExprsShort)).foreach { case (dt, t, ds) =>
          val start = System.nanoTime()
          ds.foreach(e => e.eval(EmptyRow))
          val endNoCodegen = System.nanoTime()
          println(s"Running $nData op with $t code on $dt (no-codegen): 
${(endNoCodegen - start) / 1000000} ms")
        }
      }
    ```
    
    The results on my laptop are:
    
    ```
    Running 1000000 op with old code on Long (no-codegen): 600 ms
    Running 1000000 op with new code on Long (no-codegen): 112 ms
    Running 1000000 op with old code on Int (no-codegen): 560 ms
    Running 1000000 op with new code on Int (no-codegen): 135 ms
    Running 1000000 op with old code on Short (no-codegen): 317 ms
    Running 1000000 op with new code on Short (no-codegen): 153 ms
    ```
    
    Showing a 2-5X improvement. The benchmark doesn't include code generation 
as it is pretty hard to test the performance there as for such simple 
operations the most of the time is spent in the code generation/compilation 
process.
    
    ## How was this patch tested?
    
    added UTs


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-16323

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22395.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22395
    
----
commit 649b45875e81e47ba3282150b669f766fbe806ba
Author: Marco Gaido <marcogaido91@...>
Date:   2018-09-11T15:50:56Z

    [SPARK-16323][SQ] Add IntegerDivide expression

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to