GideonPotok commented on code in PR #46404:
URL: https://github.com/apache/spark/pull/46404#discussion_r1594753431
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -70,20 +78,46 @@ case class Mode(
buffer
}
- override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
- if (buffer.isEmpty) {
+ override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+ if (buff.isEmpty) {
return null
}
+ val buffer = if (child.dataType.isInstanceOf[StringType] &&
collationEnabled) {
+ val modeMap = buff.foldLeft(
+ new TreeMap[org.apache.spark.unsafe.types.UTF8String,
Long]()(Ordering.comparatorToOrdering(
+ CollationFactory.fetchCollation(collationId).comparator
Review Comment:
@uros-db I basically implemented a proof of concept
[here](https://github.com/apache/spark/pull/46488).
I believe the benchmarking I did there (see the PR description) maybe shows
28x slowdown rather then a 89x slowdown. (I know we had a 130x slowdown
before, but I modified the benchmarks slightly and now it is the following
Here is for the originally proposed implementation (This PR) which was the
Red-black tree / TreeMap
```
[info] collation unit benchmarks - mode - 10000 elements: Best Time(ms)
Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info]
---------------------------------------------------------------------------------------------------------------------------------
[info] UTF8_BINARY_LCASE - mode - 10000 elements 13
15 5 7.6 130.9 1.0X
[info] UNICODE - mode - 10000 elements 39
41 2 2.6 392.0 0.3X
[info] UTF8_BINARY - mode - 10000 elements 1
1 0 129.0 7.7 16.9X
[info] UNICODE_CI - mode - 10000 elements 39
41 2 2.6 391.4 0.3X
[info] Numerical Type - mode - 10000 elements 0
1 0 204.5 4.9 26.8X
[info] OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Mac OS X 14.4.1
[info] Apple M3 Max
[info] collation unit benchmarks - mode - 5000 elements: Best Time(ms)
Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info]
--------------------------------------------------------------------------------------------------------------------------------
[info] UTF8_BINARY_LCASE - mode - 5000 elements 4
5 0 11.1 89.7 1.0X
[info] UNICODE - mode - 5000 elements 13
14 1 3.8 266.2 0.3X
[info] UTF8_BINARY - mode - 5000 elements 0
0 0 140.5 7.1 12.6X
[info] UNICODE_CI - mode - 5000 elements 13
14 1 3.8 265.1 0.3X
[info] Numerical Type - mode - 5000 elements 0
0 0 251.4 4.0 22.5X
```
And here are results for the OpenHashMap implementation seen
[here](https://github.com/apache/spark/pull/46488)
```
collation unit benchmarks - mode - 10000 elements: Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info]
---------------------------------------------------------------------------------------------------------------------------------
[info] UTF8_BINARY_LCASE - mode - 10000 elements 8
9 1 12.6 79.5 1.0X
[info] UNICODE - mode - 10000 elements 13
14 1 7.4 135.0 0.6X
[info] UTF8_BINARY - mode - 10000 elements 1
1 0 120.0 8.3 9.5X
[info] UNICODE_CI - mode - 10000 elements 14
14 1 7.3 137.9 0.6X
[info] Numerical Type - mode - 10000 elements 0
1 0 218.0 4.6 17.3X
[info] OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Mac OS X 14.4.1
[info] Apple M3 Max
[info] collation unit benchmarks - mode - 5000 elements: Best Time(ms)
Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info]
--------------------------------------------------------------------------------------------------------------------------------
[info] UTF8_BINARY_LCASE - mode - 5000 elements 3
3 0 18.6 53.6 1.0X
[info] UNICODE - mode - 5000 elements 5
6 0 9.5 104.9 0.5X
[info] UTF8_BINARY - mode - 5000 elements 0
0 0 148.4 6.7 8.0X
[info] UNICODE_CI - mode - 5000 elements 5
5 0 11.1 90.3 0.6X
[info] Numerical Type - mode - 5000 elements 0
0 0 267.6 3.7 14.3X
```
The thing I notice though is the slowdown is really happening most accutely
for the unicode collations. Is there some underlying reason they are so much
less performance then either UTF8_BINARY Collation? Is it because it can't
compare byte by byte? I get why that would affect the TreeMap implementation,
but if it is affecting the prototype for use of OpenHashMap with different
hashing function, maybe it is from the conversion to UTF8String by both
implementations. Which is not related to choice of data structures.
Please let me know what you think.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]