Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/22364
@maropu I have run the following benchmark:
```
test("AttributeSet -- benchmark") {
val attrSetA = AttributeSet((1 to 100).map { i =>
AttributeReference(s"c$i", IntegerType)() })
val attrSetB = attrSetA.take(80).toSeq
val attrSetC = (1 to 100).map { i => AttributeReference(s"c2_$i",
IntegerType)() }
val attrSetD = (attrSetA.take(50) ++ attrSetC.take(50)).toSeq
val attrSetE = attrSetC.take(50) ++ attrSetA.take(50)
val n_iter = 1000000
val t0 = System.nanoTime()
(1 to n_iter) foreach { _ =>
val r1 = attrSetA -- attrSetB
val r2 = attrSetA -- attrSetC
val r3 = attrSetA -- attrSetD
val r4 = attrSetA -- attrSetE
}
val t1 = System.nanoTime()
(1 to n_iter) foreach { _ =>
val r1 = attrSetA subsetOf AttributeSet(attrSetB)
val r2 = attrSetA subsetOf AttributeSet(attrSetC)
val r3 = attrSetA subsetOf AttributeSet(attrSetD)
val r4 = attrSetA subsetOf AttributeSet(attrSetE)
}
val t2 = System.nanoTime()
val totalTime1 = t1 - t0
val totalTime2 = t2 - t1
println(s"Average time for --: ${totalTime1 / n_iter} us")
println(s"Average time for subsetOf: ${totalTime2 / n_iter} us")
}
```
And the output is:
```
Average time for --: 25065 us
Average time for subsetOf: 108638 us
```
So for the case you mentioned, using `subsetOf` would instead introduce a
performance regression. I have also run all the tests in
StarJoinCostBasedReorderSuite for 1000 times and the perf regression was
confirmed:
```
Running StarJoinCostBasedReorderSuite's tests 1000 times takes w/o change:
68877186927 us
Running StarJoinCostBasedReorderSuite's tests 1000 times takes with change:
70689955856 us
```
The point is that there we have a `Seq[Attribute]` instead of an
`AttributeSet` as parameter.
Hope this is clear, let me know otherwise. Thanks.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]