Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression [spark]

via GitHub Thu, 09 May 2024 07:47:37 -0700


GideonPotok commented on code in PR #46404:
URL: https://github.com/apache/spark/pull/46404#discussion_r1595553952



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -70,20 +78,46 @@ case class Mode(
     buffer
   }
 
-  override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
-    if (buffer.isEmpty) {
+  override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+    if (buff.isEmpty) {
       return null
     }
+    val buffer = if (child.dataType.isInstanceOf[StringType] && 
collationEnabled) {
+      val modeMap = buff.foldLeft(
+        new TreeMap[org.apache.spark.unsafe.types.UTF8String, 
Long]()(Ordering.comparatorToOrdering(
+          CollationFactory.fetchCollation(collationId).comparator

Review Comment:
   @uros-db 
   
   I also added a third approach to the experimental pr 
https://github.com/apache/spark/pull/46488 
   
   
   What kinds of cardinalities are expected? Eg even if the data frame has 1 
Trillion Rows, would it not be highly rare use case to have 1 Trillion unique 
values? That might be worth thinking about when it comes to the range of what 
to benchmark.
   
   And what percent of values in the data frames to be benchmarked should be 
duplicate values, within the same collation? It is really just one very 
specific case we are benchmarking right now, and at least in the experimental 
branch, it could be useful to play around with that while we evaluate different 
algorithms. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression [spark]

Reply via email to