GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1626659308
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -74,16 +90,25 @@ case class Mode(
if (buffer.isEmpty) {
return null
}
-
+ val collationAwareBuffer = child.dataType match {
+ case c: StringType if
+ !CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality
=>
+ val collationId = c.collationId
+ val modeMap = buffer.toSeq.groupMapReduce {
Review Comment:
@uros-db can you weigh in on whether you think that further exploration of
dbatomic's approach would be warranted? What @dbatomic is suggesting is
slightly different then the approach i have prototyped in the past (though is
basically what you suggested very early on). dbatomic's suggestion might look
something like the change seen here in
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:
https://github.com/GideonPotok/spark/pull/2/files#diff-d758fde336b16d62c0feb00db9350822d60bd16805f8a816f2ee82b343d851aa
which I have just now whipped up.
It is possible his approach will work well, but as with past experience when
we tried to use the hashmap plus an auxilliary data structure
(https://github.com/GideonPotok/spark/pull/1/files), it might bring complexity
and more gotcha's then we might be interested in taking on for this feature.
Is it worth it? I am capable of implementing it that way, but it is going to
take a bit of work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]