wankunde commented on PR #41685:
URL: https://github.com/apache/spark/pull/41685#issuecomment-1600993090
> This test looks good. I've verified that it tests the code path.
>
> Another thing is, the claim of the proposed change is to improve distinct
queries performance. But I don't see any reported number of performance. If you
have run benchmark or you have production workloads getting improvement from
it, could you post the numbers?
A local benchmark
```java
object AggregateBenchmark extends SqlBasedBenchmark {
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
runBenchmark("aggregate benchmark with fast hashMap") {
Seq(16, 18, 20).foreach { CAPACITY_BIT =>
val N = 1 << CAPACITY_BIT
val benchmark = new Benchmark(s"HashMap size : $N", N, output =
output)
val inputDF = spark
.range(N)
.selectExpr(
"id",
"(id & 1023) as k1",
"cast(id & 1023 as string) as k2",
"cast(id & 1023 as int) as k3",
"cast(id & 1023 as double) as k4",
"cast(id & 1023 as float) as k5",
"id > 1023 as k6")
inputDF.cache()
Seq(false, true).map { enable =>
benchmark.addCase(s"Aggregate with two level aggregate $enable",
numIters = 2) { _ =>
withSQLConf(
SQLConf.ENABLE_TWOLEVEL_AGG_MAP.key -> enable.toString,
SQLConf.FAST_HASH_AGGREGATE_MAX_ROWS_CAPACITY_BIT.key ->
CAPACITY_BIT.toString) {
inputDF.distinct().noop()
}
}
}
benchmark.run()
}
}
}
}
```
Benchmark result:
```
Running benchmark: HashMap size : 65536
Running case: Aggregate with two level aggregate false
Stopped after 2 iterations, 240 ms
Running case: Aggregate with two level aggregate true
Stopped after 2 iterations, 119 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
HashMap size : 65536: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Aggregate with two level aggregate false 117 120
4 0.6 1791.9 1.0X
Aggregate with two level aggregate true 58 60
3 1.1 880.0 2.0X
Running benchmark: HashMap size : 262144
Running case: Aggregate with two level aggregate false
Stopped after 2 iterations, 339 ms
Running case: Aggregate with two level aggregate true
Stopped after 2 iterations, 270 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
HashMap size : 262144: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Aggregate with two level aggregate false 169 170
1 1.6 644.7 1.0X
Aggregate with two level aggregate true 134 135
2 2.0 510.3 1.3X
Running benchmark: HashMap size : 1048576
Running case: Aggregate with two level aggregate false
Stopped after 2 iterations, 1353 ms
Running case: Aggregate with two level aggregate true
Stopped after 2 iterations, 1771 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.16
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
HashMap size : 1048576: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Aggregate with two level aggregate false 672 677
6 1.6 641.2 1.0X
Aggregate with two level aggregate true 749 886
193 1.4 714.2 0.9X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]