GideonPotok commented on code in PR #46404:
URL: https://github.com/apache/spark/pull/46404#discussion_r1592911161
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -70,20 +78,46 @@ case class Mode(
buffer
}
- override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
- if (buffer.isEmpty) {
+ override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+ if (buff.isEmpty) {
return null
}
+ val buffer = if (child.dataType.isInstanceOf[StringType] &&
collationEnabled) {
+ val modeMap = buff.foldLeft(
+ new TreeMap[org.apache.spark.unsafe.types.UTF8String,
Long]()(Ordering.comparatorToOrdering(
+ CollationFactory.fetchCollation(collationId).comparator
Review Comment:
I will be running it soon in GHA and will then add the `...results.txt`
report to the code change.
Here is a sneak Peek.
Note that once I make your other suggested changes, the UTF8_BINARY case
will become the "Collation Not Enabled" Case, once I remove the
collationEnabled flag from the Mode class signature...
```
[info] OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Mac OS X 14.4.1
[info] Apple M3 Max
[info] collation unit benchmarks - mode: Best Time(ms) Avg
Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info]
------------------------------------------------------------------------------------------------------------------------
[info] UTF8_BINARY_LCASE 1027
1198 241 0.1 10272.2 1.0X
[info] UNICODE 3249
3262 19 0.0 32485.8 0.3X
[info] UTF8_BINARY 805
806 1 0.1 8050.6 1.3X
[info] UNICODE_CI 3623
3632 13 0.0 36233.6 0.3X
[info] Collation Not Enabled 28
30 1 3.5 284.6 36.1X
[info] Numerical Type 28
29 1 3.6 277.5 37.0X
```
As you can see, the performance is impacted!
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -70,20 +78,46 @@ case class Mode(
buffer
}
- override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
- if (buffer.isEmpty) {
+ override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+ if (buff.isEmpty) {
return null
}
+ val buffer = if (child.dataType.isInstanceOf[StringType] &&
collationEnabled) {
+ val modeMap = buff.foldLeft(
+ new TreeMap[org.apache.spark.unsafe.types.UTF8String,
Long]()(Ordering.comparatorToOrdering(
+ CollationFactory.fetchCollation(collationId).comparator
Review Comment:
Here is my plan:
1. Get your thoughts on these performance results...
2. Analyze the computation complexity in theory. Being very standard java
Collection classes, the TreeMap vs OpenHashMap documentation will surely will
state what their complexity class/big-o is known to be. After that, I will
provide this analysis in the PR description.
3. I can analyze the computation complexity empirically: I can either add to
the benchmark or do an ad-hoc benchmark to determine how it grows as input
grows the time grows .
Do you have any thoughts on what else I might want to look into?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]