Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression [spark]

via GitHub Wed, 08 May 2024 15:07:25 -0700


GideonPotok commented on code in PR #46404:
URL: https://github.com/apache/spark/pull/46404#discussion_r1594753431



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -70,20 +78,46 @@ case class Mode(
     buffer
   }
 
-  override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
-    if (buffer.isEmpty) {
+  override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+    if (buff.isEmpty) {
       return null
     }
+    val buffer = if (child.dataType.isInstanceOf[StringType] && 
collationEnabled) {
+      val modeMap = buff.foldLeft(
+        new TreeMap[org.apache.spark.unsafe.types.UTF8String, 
Long]()(Ordering.comparatorToOrdering(
+          CollationFactory.fetchCollation(collationId).comparator

Review Comment:
   @uros-db I basically implemented a proof of concept 
[here](https://github.com/apache/spark/pull/46488). 
   
   I believe the benchmarking I did there (see the PR description) maybe shows 
28x slowdown rather then a 89x slowdown.  (I know we had a 130x slowdown 
before, but I modified the benchmarks slightly and now it is the following
   
   Here is for the originally proposed implementation (This PR) which was the 
Red-black tree / TreeMap 
   
   ```
   [info] collation unit benchmarks - mode - 10000 elements:  Best Time(ms)   
Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
---------------------------------------------------------------------------------------------------------------------------------
   [info] UTF8_BINARY_LCASE - mode - 10000 elements                     13      
       15           5          7.6         130.9       1.0X
   [info] UNICODE - mode - 10000 elements                               39      
       41           2          2.6         392.0       0.3X
   [info] UTF8_BINARY - mode - 10000 elements                            1      
        1           0        129.0           7.7      16.9X
   [info] UNICODE_CI - mode - 10000 elements                            39      
       41           2          2.6         391.4       0.3X
   [info] Numerical Type - mode - 10000 elements                         0      
        1           0        204.5           4.9      26.8X
   
   [info] OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Mac OS X 14.4.1
   [info] Apple M3 Max
   [info] collation unit benchmarks - mode - 5000 elements:  Best Time(ms)   
Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
--------------------------------------------------------------------------------------------------------------------------------
   [info] UTF8_BINARY_LCASE - mode - 5000 elements                      4       
       5           0         11.1          89.7       1.0X
   [info] UNICODE - mode - 5000 elements                               13       
      14           1          3.8         266.2       0.3X
   [info] UTF8_BINARY - mode - 5000 elements                            0       
       0           0        140.5           7.1      12.6X
   [info] UNICODE_CI - mode - 5000 elements                            13       
      14           1          3.8         265.1       0.3X
   [info] Numerical Type - mode - 5000 elements                         0       
       0           0        251.4           4.0      22.5X
   ```
   
   And here are results for the OpenHashMap implementation seen 
[here](https://github.com/apache/spark/pull/46488)
   
   ```
   collation unit benchmarks - mode - 10000 elements:  Best Time(ms)   Avg 
Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
---------------------------------------------------------------------------------------------------------------------------------
   [info] UTF8_BINARY_LCASE - mode - 10000 elements                      8      
        9           1         12.6          79.5       1.0X
   [info] UNICODE - mode - 10000 elements                               13      
       14           1          7.4         135.0       0.6X
   [info] UTF8_BINARY - mode - 10000 elements                            1      
        1           0        120.0           8.3       9.5X
   [info] UNICODE_CI - mode - 10000 elements                            14      
       14           1          7.3         137.9       0.6X
   [info] Numerical Type - mode - 10000 elements                         0      
        1           0        218.0           4.6      17.3X
   
   
   [info] OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Mac OS X 14.4.1
   [info] Apple M3 Max
   [info] collation unit benchmarks - mode - 5000 elements:  Best Time(ms)   
Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] 
--------------------------------------------------------------------------------------------------------------------------------
   [info] UTF8_BINARY_LCASE - mode - 5000 elements                      3       
       3           0         18.6          53.6       1.0X
   [info] UNICODE - mode - 5000 elements                                5       
       6           0          9.5         104.9       0.5X
   [info] UTF8_BINARY - mode - 5000 elements                            0       
       0           0        148.4           6.7       8.0X
   [info] UNICODE_CI - mode - 5000 elements                             5       
       5           0         11.1          90.3       0.6X
   [info] Numerical Type - mode - 5000 elements                         0       
       0           0        267.6           3.7      14.3X
   ```
   
   The thing I notice though is the slowdown is really happening most accutely 
for the unicode collations. Is there some underlying reason they are so much 
less performance then either UTF8_BINARY Collation? Is it because it can't 
compare byte by byte? I get why that would affect the TreeMap implementation, 
but if it is affecting the prototype for use of OpenHashMap with different 
hashing function, maybe it is from the conversion to UTF8String by both 
implementations. Which is not related to choice of data structures.
   
   Please let me know what you think. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression [spark]

Reply via email to