andygrove opened a new issue, #3183:
URL: https://github.com/apache/datafusion-comet/issues/3183
## What is the problem the feature request solves?
> **Note:** This issue was generated with AI assistance. The specification
details have been extracted from Spark documentation and may need verification.
Comet does not currently support the Spark `encode` function, causing
queries using this function to fall back to Spark's JVM execution instead of
running natively on DataFusion.
The Encode expression converts a string to binary data using a specified
character encoding. It is a runtime-replaceable expression that delegates to a
static method for actual encoding operations, supporting legacy charset and
error handling configurations.
Supporting this expression would allow more Spark workloads to benefit from
Comet's native acceleration.
## Describe the potential solution
### Spark Specification
**Syntax:**
```sql
ENCODE(string_expr, charset_expr)
```
**Arguments:**
| Argument | Type | Description |
|----------|------|-------------|
| str | String | The input string expression to be encoded |
| charset | String | The character set/encoding name to use for conversion |
| legacyCharsets | Boolean | Internal flag for legacy Java charset handling
behavior |
| legacyErrorAction | Boolean | Internal flag for legacy coding error action
behavior |
**Return Type:** BinaryType - Returns binary data representing the encoded
string.
**Supported Data Types:**
- Input string: StringTypeWithCollation (supports trim collation)
- Input charset: StringTypeWithCollation (supports trim collation)
- Both string inputs support collation-aware string types
**Edge Cases:**
- Null input string or charset results in null output
- Invalid charset names may throw runtime exceptions
- Legacy flags affect error handling behavior for malformed input
- Empty string input produces empty binary output
- Behavior varies based on SQLConf legacy settings for charset and error
handling
**Examples:**
```sql
-- Encode string using UTF-8
SELECT ENCODE('hello world', 'UTF-8');
-- Encode with different charset
SELECT ENCODE('café', 'ISO-8859-1');
```
```scala
// DataFrame API usage
import org.apache.spark.sql.functions._
df.select(expr("ENCODE(name, 'UTF-8')").as("encoded_name"))
// Using column expressions
df.select(encode(col("text_column"), lit("UTF-16")))
```
### Implementation Approach
See the [Comet guide on adding new
expressions](https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html)
for detailed instructions.
1. **Scala Serde**: Add expression handler in
`spark/src/main/scala/org/apache/comet/serde/`
2. **Register**: Add to appropriate map in `QueryPlanSerde.scala`
3. **Protobuf**: Add message type in `native/proto/src/proto/expr.proto` if
needed
4. **Rust**: Implement in `native/spark-expr/src/` (check if DataFusion has
built-in support first)
## Additional context
**Difficulty:** Medium
**Spark Expression Class:**
`org.apache.spark.sql.catalyst.expressions.Encode`
**Related:**
- Decode - inverse operation to convert binary back to string
- Cast expressions for type conversions
- String manipulation functions in string_funcs group
---
*This issue was auto-generated from Spark reference documentation.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]