mitkedb commented on code in PR #45216:
URL: https://github.com/apache/spark/pull/45216#discussion_r1499330134
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##########
@@ -586,9 +584,17 @@ object ContainsExpressionBuilder extends
StringBinaryPredicateExpressionBuilderB
}
case class Contains(left: Expression, right: Expression) extends
StringPredicate {
- override def compare(l: UTF8String, r: UTF8String): Boolean = l.contains(r)
+ override def compare(l: UTF8String, r: UTF8String): Boolean = {
+ val collationID = right.dataType.asInstanceOf[StringType].collationId
Review Comment:
@dbatomic @uros-db
Should we consider special casing the path for the default UTF8 binary
collation, to prevent any potential performance regression in the default code
path?
Something like the following:
if right.dataType.isDefaultCollationFinalVariable (it is important that the
variable is final as it will result in significantly better optimizations by
JIT IIRC)
default / current implementation
else
collation implementation
And the codegen version of the code should remain identical to how it was
before we added collations... I am guessing we can check the collation ID at
compile time, and just generate exactly the same codegen code for UTF8 binary
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]