andygrove opened a new issue, #3165:
URL: https://github.com/apache/datafusion-comet/issues/3165
## What is the problem the feature request solves?
> **Note:** This issue was generated with AI assistance. The specification
details have been extracted from Spark documentation and may need verification.
Comet does not currently support the Spark `map_filter` function, causing
queries using this function to fall back to Spark's JVM execution instead of
running natively on DataFusion.
MapFilter is a higher-order function that filters map entries based on a
lambda predicate function. It applies the provided lambda function to each
key-value pair in the input map and returns a new map containing only the
entries where the predicate evaluates to true.
Supporting this expression would allow more Spark workloads to benefit from
Comet's native acceleration.
## Describe the potential solution
### Spark Specification
**Syntax:**
```sql
map_filter(map_expression, lambda_function)
```
```scala
// DataFrame API usage
map_filter(col("map_column"), (k, v) => k > v)
```
**Arguments:**
| Argument | Type | Description |
|----------|------|-------------|
| map_expression | MapType | The input map to be filtered |
| lambda_function | Lambda | A function that takes two parameters (key,
value) and returns a boolean |
**Return Type:** Returns a MapType with the same key and value types as the
input map, containing only the entries that satisfy the predicate condition.
**Supported Data Types:**
- Input map can have keys and values of any supported Spark SQL data type
- The lambda function must return a boolean expression
- Key and value types are preserved in the output map
**Edge Cases:**
- If input map is null, returns null
- If lambda function returns null for a key-value pair, that pair is
excluded (treated as false)
- Empty input map returns empty map of the same type
- Lambda function exceptions will cause the entire expression to fail
- Maintains original key and value data types in filtered result
**Examples:**
```sql
-- Filter map entries where key is greater than value
SELECT map_filter(map(1, 0, 2, 2, 3, -1), (k, v) -> k > v);
-- Returns: {1:0, 3:-1}
-- Filter string map by key length
SELECT map_filter(map('a', 'apple', 'bb', 'banana'), (k, v) -> length(k) >
1);
-- Returns: {'bb':'banana'}
```
```scala
// DataFrame API usage
import org.apache.spark.sql.functions._
df.select(map_filter(col("data_map"), (k, v) => k > v))
// Filter nested map column
df.withColumn("filtered_map",
map_filter(col("original_map"), (key, value) => key.isNotNull && value >
0))
```
### Implementation Approach
See the [Comet guide on adding new
expressions](https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html)
for detailed instructions.
1. **Scala Serde**: Add expression handler in
`spark/src/main/scala/org/apache/comet/serde/`
2. **Register**: Add to appropriate map in `QueryPlanSerde.scala`
3. **Protobuf**: Add message type in `native/proto/src/proto/expr.proto` if
needed
4. **Rust**: Implement in `native/spark-expr/src/` (check if DataFusion has
built-in support first)
## Additional context
**Difficulty:** Large
**Spark Expression Class:**
`org.apache.spark.sql.catalyst.expressions.MapFilter`
**Related:**
- `map_zip_with` - applies a function to corresponding pairs from two maps
- `transform_keys` - transforms map keys using a lambda function
- `transform_values` - transforms map values using a lambda function
- `filter` - filters array elements using a predicate function
---
*This issue was auto-generated from Spark reference documentation.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]