GideonPotok commented on code in PR #46404:
URL: https://github.com/apache/spark/pull/46404#discussion_r1595553952


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -70,20 +78,46 @@ case class Mode(
     buffer
   }
 
-  override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
-    if (buffer.isEmpty) {
+  override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+    if (buff.isEmpty) {
       return null
     }
+    val buffer = if (child.dataType.isInstanceOf[StringType] && 
collationEnabled) {
+      val modeMap = buff.foldLeft(
+        new TreeMap[org.apache.spark.unsafe.types.UTF8String, 
Long]()(Ordering.comparatorToOrdering(
+          CollationFactory.fetchCollation(collationId).comparator

Review Comment:
   @uros-db 
   
   I also added a third approach to the experimental pr 
https://github.com/apache/spark/pull/46488 
   
   
   What kinds of cardinalities are expected? Eg even if the data frame has 1 
Trillion Rows, would it not be highly rare use case to have 1 Trillion unique 
values? That might be worth thinking about when it comes to the range of what 
to benchmark.
   
   And what percent of values in the data frames to be benchmarked should be 
duplicate values, within the same collation? It is really just one very 
specific case we are benchmarking right now, and at least in the experimental 
branch, it could be useful to play around with that while we evaluate different 
algorithms. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to