[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...

mgaido91 Fri, 14 Sep 2018 04:29:00 -0700

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/22364
  
    @maropu I have run the following benchmark:
    
    ```
      test("AttributeSet -- benchmark") {
        val attrSetA = AttributeSet((1 to 100).map { i => 
AttributeReference(s"c$i", IntegerType)() })
        val attrSetB = attrSetA.take(80).toSeq
        val attrSetC = (1 to 100).map { i => AttributeReference(s"c2_$i", 
IntegerType)() }
        val attrSetD = (attrSetA.take(50) ++ attrSetC.take(50)).toSeq
        val attrSetE = attrSetC.take(50) ++ attrSetA.take(50)
        val n_iter = 1000000
        val t0 = System.nanoTime()
        (1 to n_iter) foreach { _ =>
          val r1 = attrSetA -- attrSetB
          val r2 = attrSetA -- attrSetC
          val r3 = attrSetA -- attrSetD
          val r4 = attrSetA -- attrSetE
        }
        val t1 = System.nanoTime()
        (1 to n_iter) foreach { _ =>
          val r1 = attrSetA subsetOf AttributeSet(attrSetB)
          val r2 = attrSetA subsetOf AttributeSet(attrSetC)
          val r3 = attrSetA subsetOf AttributeSet(attrSetD)
          val r4 = attrSetA subsetOf AttributeSet(attrSetE)
        }
        val t2 = System.nanoTime()
        val totalTime1 = t1 - t0
        val totalTime2 = t2 - t1
        println(s"Average time for --: ${totalTime1 / n_iter} us")
        println(s"Average time for subsetOf: ${totalTime2 / n_iter} us")
      }
    ```
    
    And the output is:
    ```
    Average time for --: 25065 us
    Average time for subsetOf: 108638 us
    ```
    So for the case you mentioned, using `subsetOf` would instead introduce a 
performance regression. I have also run all the tests in 
StarJoinCostBasedReorderSuite for 1000 times and the perf regression was 
confirmed:
    ```
    Running StarJoinCostBasedReorderSuite's tests 1000 times takes w/o  change: 
68877186927 us
    Running StarJoinCostBasedReorderSuite's tests 1000 times takes with change: 
70689955856 us
    ```
    The point is that there we have a `Seq[Attribute]` instead of an 
`AttributeSet` as parameter.
    
    Hope this is clear, let me know otherwise. Thanks.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...

Reply via email to