n0r0shi opened a new pull request, #3571: URL: https://github.com/apache/datafusion-comet/pull/3571
## Which issue does this PR close? Closes #419. ## Rationale for this change Spark's `base64()` expression is currently not supported by Comet, causing fallback to Spark. This is a commonly used function that can be mapped directly to DataFusion's built-in `encode(input, 'base64')` function with no Rust changes. ## What changes are included in this PR? Two code paths handle different Spark versions: - **Spark 3.4**: `Base64` is a direct expression node. Added `CometBase64` handler in `strings.scala` and registered it in the `stringExpressions` map. - **Spark 3.5+**: `Base64` is `RuntimeReplaceable` — Spark's optimizer rewrites it into `StaticInvoke(Base64.encode, [input, chunkBase64])` before Comet sees the plan. Added `CometBase64Encode` handler in `statics.scala` to handle this. Both paths produce the same DataFusion call: `encode(input, 'base64')`. **Chunked base64** (`spark.sql.chunkBase64String.enabled=true`, which inserts newlines every 76 chars per RFC 2045) is not supported by DataFusion's `encode` function, so it falls back to Spark. I can take a look at DataFusion side for this later. ## Are these changes tested? - Normal base64 encoding: `checkSparkAnswerAndOperator` verifies correct results and native Comet execution - NULL handling: verified via `checkSparkAnswerAndOperator` - Chunked base64 fallback: `checkSparkAnswerAndFallbackReason` verifies correct results via Spark fallback and checks the expected fallback reason message - The Spark 3.4 direct expression handler (`CometBase64`) is exercised when CI runs the `spark-3.4` profile. On Spark 3.5+ it is not reached because Spark replaces `Base64` with `StaticInvoke` during optimization. ## Are there any user-facing changes? Yes. `base64()` is now executed natively by Comet instead of falling back to Spark, improving performance for queries using this function. ```sql SELECT base64(binary_column) FROM table ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
