GideonPotok commented on code in PR #46404:
URL: https://github.com/apache/spark/pull/46404#discussion_r1595553952
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -70,20 +78,46 @@ case class Mode(
buffer
}
- override def eval(buffer: OpenHashMap[AnyRef, Long]): Any = {
- if (buffer.isEmpty) {
+ override def eval(buff: OpenHashMap[AnyRef, Long]): Any = {
+ if (buff.isEmpty) {
return null
}
+ val buffer = if (child.dataType.isInstanceOf[StringType] &&
collationEnabled) {
+ val modeMap = buff.foldLeft(
+ new TreeMap[org.apache.spark.unsafe.types.UTF8String,
Long]()(Ordering.comparatorToOrdering(
+ CollationFactory.fetchCollation(collationId).comparator
Review Comment:
@uros-db
I also added a third approach to the experimental pr
https://github.com/apache/spark/pull/46488
What kinds of cardinalities are expected? Eg even if the data frame has 1
Trillion Rows, would it not be highly rare use case to have 1 Trillion unique
values? That might be worth thinking about when it comes to the range of what
to benchmark.
And what percent of values in the data frames to be benchmarked should be
duplicate values, within the same collation? It is really just one very
specific case we are benchmarking right now, and at least in the experimental
branch, it could be useful to play around with that while we evaluate different
algorithms.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]