GideonPotok commented on code in PR #46404:
URL: https://github.com/apache/spark/pull/46404#discussion_r1593161711
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -70,20 +78,46 @@ case class Mode(
buffer
}
- override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
- if (buffer.isEmpty) {
+ override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+ if (buff.isEmpty) {
return null
}
+ val buffer = if (child.dataType.isInstanceOf[StringType] &&
collationEnabled) {
+ val modeMap = buff.foldLeft(
+ new TreeMap[org.apache.spark.unsafe.types.UTF8String,
Long]()(Ordering.comparatorToOrdering(
+ CollationFactory.fetchCollation(collationId).comparator
Review Comment:
@uros-db `org.apache.spark.util.collection.OpenHashMap` is actually a spark
data structure, so I could try to modify it in a second candidate PR and then
see about having `Mode` rely on a `OpenHashMap` configured to be collation
aware..
The way I could do it would be to modify
`org.apache.spark.util.collection.OpenHashSet.Hasher.hash()` , specifically the
override of hash within a class we would create `Hasher[String with
Collation...]`, to branch to an alternative hashing method that is collation
sensitive.
(I would not be modifying
`org.apache.spark.util.collection.OpenHashSet#hashcode` , which distributes the
hashes more evenly and is called on values returned by `Hasher.hash`)
This should work because `OpenHashMap` relies internally on `OpenHashSet`
which uses `org.apache.spark.util.collection.OpenHashSet.Hasher`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]