Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression [spark]

via GitHub Tue, 07 May 2024 15:12:51 -0700


GideonPotok commented on code in PR #46404:
URL: https://github.com/apache/spark/pull/46404#discussion_r1593161711



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -70,20 +78,46 @@ case class Mode(
     buffer
   }
 
-  override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
-    if (buffer.isEmpty) {
+  override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+    if (buff.isEmpty) {
       return null
     }
+    val buffer = if (child.dataType.isInstanceOf[StringType] && 
collationEnabled) {
+      val modeMap = buff.foldLeft(
+        new TreeMap[org.apache.spark.unsafe.types.UTF8String, 
Long]()(Ordering.comparatorToOrdering(
+          CollationFactory.fetchCollation(collationId).comparator

Review Comment:
   @uros-db `org.apache.spark.util.collection.OpenHashMap` is actually a spark 
data structure, so I could try to modify it in a second candidate PR and then 
see about having `Mode` rely on a `OpenHashMap` configured to be collation 
aware..
   
   The way I could  do it would be to modify   
`org.apache.spark.util.collection.OpenHashSet.Hasher.hash()` , specifically the 
override of hash within a class we would create `Hasher[String with 
Collation...]`, to branch to an alternative hashing method that is collation 
sensitive. 
   
   (I would not be modifying 
`org.apache.spark.util.collection.OpenHashSet#hashcode` , which distributes the 
hashes more evenly and is called on values returned by `Hasher.hash`) 
   
   This should work because `OpenHashMap` relies internally on `OpenHashSet` 
which uses `org.apache.spark.util.collection.OpenHashSet.Hasher`
   
    



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression [spark]

Reply via email to