GideonPotok commented on code in PR #46597:
URL: https://github.com/apache/spark/pull/46597#discussion_r1627976630
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala:
##########
@@ -74,16 +90,25 @@ case class Mode(
if (buffer.isEmpty) {
return null
}
-
+ val collationAwareBuffer = child.dataType match {
+ case c: StringType if
+ !CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality
=>
+ val collationId = c.collationId
+ val modeMap = buffer.toSeq.groupMapReduce {
Review Comment:
@dbatomic @uros-db here is a mockup of this proposal:
https://github.com/GideonPotok/spark/pull/2. To reiterate, this is merely a
mockup/proof of concept.
The relevant test has passed, which indicates that this approach could be
viable. Now, we need to consider whether to advance and determine how to
integrate the relevant information about key datatype into the OpenHashMap.
What are your thoughts on the feasibility of moving forward?
I'm primarily concerned about the risks involved: Integrating collation with
specialized types and complex hash functions might increase maintenance
demands, introduce serialization challenges, and lead to subtle bugs.
Considering the crucial nature of this data structure, we should approach any
changes with a detailed plan for validation, ensuring backward compatibility,
and thorough performance testing. It may be wise to consider less invasive
modifications , such as the one proposed in this PR.
Despite these concerns, this approach is functioning, and it touches on a
particularly intriguing part of the codebase that I am eager to bite into. If
you think it's a promising route, I'm ready to complete the implementation and
perform further benchmarks. However, I would appreciate some design suggestions
as mentioned below.
To effectively implement this, I see two possible directions:
1. Is there a benefit to using `AnyRef` (as in `OpenHashMap[AnyRef, ...]`)
by `TypedAggregateWithHashMapAsBuffer`? This was introduced here:
https://github.com/apache/spark/pull/37216/files without a clear explanation of
why `AnyRef` was preferred over generics. Should
`TypedAggregateWithHashMapAsBuffer` remain unchanged, or should it evolve to
rely on `OpenHashMap[childExpression.dataType.getClass, ...]` for more specific
typing? @beliefer, although it’s been some time since you worked on this, could
you advise on whether this component should be modified?
2. Assuming `TypedAggregateWithHashMapAsBuffer` remains unchanged, I'm
seeking a more effective method to inject the custom hashing logic (and a
custom `keyExistsAtPos` method) from `Mode` into the OpenHashMap, depending on
the `childExpr.dataType`. I would greatly value ideas on how to best integrate
this. At the moment, the proof of concept is assuming any object passed into
`OpenHashSet` that is not `Long,Int,Double, or Float` is a `UTF8String` with
`UTF8_BINARY_LCASE` collation.
Lastly, while I am eager to complete the implementation, I hope to ensure
that this is something you would definitively want to pursue, barring any
significant performance setbacks revealed by benchmarking. I've developed this
proof of concept and it's operational, but a full implementation should ideally
be something you are confident is the right direction.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]