Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression [spark]

via GitHub Tue, 07 May 2024 11:36:54 -0700


GideonPotok commented on code in PR #46404:
URL: https://github.com/apache/spark/pull/46404#discussion_r1592911161



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -70,20 +78,46 @@ case class Mode(
     buffer
   }
 
-  override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
-    if (buffer.isEmpty) {
+  override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+    if (buff.isEmpty) {
       return null
     }
+    val buffer = if (child.dataType.isInstanceOf[StringType] && 
collationEnabled) {
+      val modeMap = buff.foldLeft(
+        new TreeMap[org.apache.spark.unsafe.types.UTF8String, 
Long]()(Ordering.comparatorToOrdering(
+          CollationFactory.fetchCollation(collationId).comparator

Review Comment:
   I will be running it soon in GHA and will then add the `...results.txt` 
report to the code change.
   
   Here is a sneak Peek. 
   
   Note that once I make your other suggested changes, the UTF8_BINARY case 
will become the "Collation Not Enabled" Case, once I remove the 
collationEnabled flag from the Mode class signature...
   
   ``` 
   [info] OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Mac OS X 14.4.1
   [info] Apple M3 Max
   [info] collation unit benchmarks - mode:         Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
------------------------------------------------------------------------------------------------------------------------
   [info] UTF8_BINARY_LCASE                                  1027           
1198         241          0.1       10272.2       1.0X
   [info] UNICODE                                            3249           
3262          19          0.0       32485.8       0.3X
   [info] UTF8_BINARY                                         805            
806           1          0.1        8050.6       1.3X
   [info] UNICODE_CI                                         3623           
3632          13          0.0       36233.6       0.3X
   [info] Collation Not Enabled                                28             
30           1          3.5         284.6      36.1X
   [info] Numerical Type                                       28             
29           1          3.6         277.5      37.0X
   ```
   
   As you can see, the performance is impacted! 
   



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -70,20 +78,46 @@ case class Mode(
     buffer
   }
 
-  override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
-    if (buffer.isEmpty) {
+  override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+    if (buff.isEmpty) {
       return null
     }
+    val buffer = if (child.dataType.isInstanceOf[StringType] && 
collationEnabled) {
+      val modeMap = buff.foldLeft(
+        new TreeMap[org.apache.spark.unsafe.types.UTF8String, 
Long]()(Ordering.comparatorToOrdering(
+          CollationFactory.fetchCollation(collationId).comparator

Review Comment:
   
   Here is my plan: 
   1. Get your thoughts on these performance results... 
   2.  Analyze the computation complexity in theory. Being very standard java 
Collection classes, the TreeMap vs OpenHashMap documentation will surely will 
state what their complexity class/big-o is known to be. After that, I will 
provide this analysis in the PR description.
   3. I can analyze the computation complexity empirically: I can either add to 
the benchmark or do an ad-hoc benchmark to determine how it grows as input 
grows the time grows . 
   
   Do you have any thoughts on what else I might want to look into?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression [spark]

Reply via email to