andygrove opened a new issue, #4516: URL: https://github.com/apache/datafusion-comet/issues/4516
When a `ScalaUDF` is dispatched into the native plan via the JVM Scala UDF codegen dispatcher (enabled by default in #4514), Comet's `CometProject` and `CometHashAggregate` do not implement Spark's cross-sibling common subexpression elimination over `ScalaUDF`. An expression such as `sum(udf(b) + udf(b) + udf(b))` therefore invokes the UDF body once per reference instead of once. This is observable in Spark's `SQLQuerySuite` "Common subexpression elimination" test: the call count for the aggregate case is 3 under Comet versus 1 in vanilla Spark. The query result is unchanged; only the number of UDF invocations differs. Follow-on from #4514. We should extend the cross-sibling CSE that `CometProject` performs to the aggregate operator's input projection (and any other operator that builds an input projection) so that repeated `ScalaUDF` references are evaluated once. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
