GideonPotok commented on code in PR #46526:
URL: https://github.com/apache/spark/pull/46526#discussion_r1598747396


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -70,10 +74,22 @@ case class Mode(
     buffer
   }
 
-  override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
-    if (buffer.isEmpty) {
+  override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+    if (buff.isEmpty) {
       return null
     }
+    val buffer = if (isCollatedString(child)) {
+      val modeMap = buff.toSeq.groupMapReduce {
+        case (key: String, _) =>
+          CollationFactory.getCollationKey(UTF8String.fromString(key), 
collationId)
+        case (key: UTF8String, _) =>
+          CollationFactory.getCollationKey(key, collationId)
+        case (key, _) => key
+      }(x => x)((x, y) => (x._1, x._2 + y._2)).values
+      modeMap
+    } else {
+      buff
+    }

Review Comment:
   @uros-db Addressing the initial question about why Mode isn't functioning 
with collated strings in Spark as expected, even though Aggregation typically 
works: It seems that while the ordering for PhysicalStringType is correctly 
established using `private[sql] val ordering = 
CollationFactory.fetchCollation(collationId).comparator.compare(_, _)`, this 
does not automatically resolve the issue with Mode. To illustrate, consider the 
example of UTF8_BINARY_LCASE where an input like `Map("a" -> 3L, "b" -> 2L, "B" 
-> 2L)` results in evaluating the maximum over the tuples `(2L, "B"), (2L, 
"b"), (3L, "a")` rather than the expected `(3L, "a"), (4L, "b")`. This 
indicates that the current approach doesn't aggregate values as required for 
Mode to operate correctly. Unit tests confirm that Mode otherwise won't work 
for such cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to