[PR] feat: Add Spark-compatible `encode` function to datafusion-spark [datafusion]

via GitHub Thu, 02 Apr 2026 23:47:33 -0700


JeelRajodiya opened a new pull request, #21331:
URL: https://github.com/apache/datafusion/pull/21331


   **Rationale**
   
   The `datafusion-spark` crate is missing the `encode` function. Spark's 
[`encode(expr, 
charset)`](https://spark.apache.org/docs/latest/api/sql/index.html#encode) 
converts a string or binary value into binary using a specified character 
encoding — this is commonly used in Spark SQL workloads and needed by engines 
built on DataFusion that target Spark compatibility.
   
   **What changes are included in this PR?**
   
   Adds `SparkEncode` to `datafusion-spark`'s string functions. It supports 
**US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE, and UTF-16LE** charsets. 
Binary input is handled via lossy UTF-8 conversion (invalid bytes → U+FFFD), 
matching Spark/Databricks behavior.
   
   **Are these changes tested?**
   
   Yes — 15 unit tests covering all charsets, case-insensitive charset 
matching, null handling, binary input with lossy UTF-8, Utf8View columns, 
unsupported charset errors, and return field nullability.
   
   **Are there any user-facing changes?**
   
   New `encode` scalar function available when using `datafusion-spark`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: Add Spark-compatible `encode` function to datafusion-spark [datafusion]

Reply via email to