[I] [Feature] Support Spark expression: map_filter [datafusion-comet]

via GitHub Wed, 14 Jan 2026 17:04:00 -0800


andygrove opened a new issue, #3165:
URL: https://github.com/apache/datafusion-comet/issues/3165


   ## What is the problem the feature request solves?
   
   > **Note:** This issue was generated with AI assistance. The specification 
details have been extracted from Spark documentation and may need verification.
   
   Comet does not currently support the Spark `map_filter` function, causing 
queries using this function to fall back to Spark's JVM execution instead of 
running natively on DataFusion.
   
   MapFilter is a higher-order function that filters map entries based on a 
lambda predicate function. It applies the provided lambda function to each 
key-value pair in the input map and returns a new map containing only the 
entries where the predicate evaluates to true.
   
   Supporting this expression would allow more Spark workloads to benefit from 
Comet's native acceleration.
   
   ## Describe the potential solution
   
   ### Spark Specification
   
   **Syntax:**
   ```sql
   map_filter(map_expression, lambda_function)
   ```
   
   ```scala
   // DataFrame API usage
   map_filter(col("map_column"), (k, v) => k > v)
   ```
   
   **Arguments:**
   | Argument | Type | Description |
   |----------|------|-------------|
   | map_expression | MapType | The input map to be filtered |
   | lambda_function | Lambda | A function that takes two parameters (key, 
value) and returns a boolean |
   
   **Return Type:** Returns a MapType with the same key and value types as the 
input map, containing only the entries that satisfy the predicate condition.
   
   **Supported Data Types:**
   - Input map can have keys and values of any supported Spark SQL data type
   - The lambda function must return a boolean expression
   - Key and value types are preserved in the output map
   
   **Edge Cases:**
   - If input map is null, returns null
   - If lambda function returns null for a key-value pair, that pair is 
excluded (treated as false)
   - Empty input map returns empty map of the same type
   - Lambda function exceptions will cause the entire expression to fail
   - Maintains original key and value data types in filtered result
   
   **Examples:**
   ```sql
   -- Filter map entries where key is greater than value
   SELECT map_filter(map(1, 0, 2, 2, 3, -1), (k, v) -> k > v);
   -- Returns: {1:0, 3:-1}
   
   -- Filter string map by key length
   SELECT map_filter(map('a', 'apple', 'bb', 'banana'), (k, v) -> length(k) > 
1);
   -- Returns: {'bb':'banana'}
   ```
   
   ```scala
   // DataFrame API usage
   import org.apache.spark.sql.functions._
   
   df.select(map_filter(col("data_map"), (k, v) => k > v))
   
   // Filter nested map column
   df.withColumn("filtered_map", 
     map_filter(col("original_map"), (key, value) => key.isNotNull && value > 
0))
   ```
   
   ### Implementation Approach
   
   See the [Comet guide on adding new 
expressions](https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html)
 for detailed instructions.
   
   1. **Scala Serde**: Add expression handler in 
`spark/src/main/scala/org/apache/comet/serde/`
   2. **Register**: Add to appropriate map in `QueryPlanSerde.scala`
   3. **Protobuf**: Add message type in `native/proto/src/proto/expr.proto` if 
needed
   4. **Rust**: Implement in `native/spark-expr/src/` (check if DataFusion has 
built-in support first)
   
   
   ## Additional context
   
   **Difficulty:** Large
   **Spark Expression Class:** 
`org.apache.spark.sql.catalyst.expressions.MapFilter`
   
   **Related:**
   - `map_zip_with` - applies a function to corresponding pairs from two maps
   - `transform_keys` - transforms map keys using a lambda function  
   - `transform_values` - transforms map values using a lambda function
   - `filter` - filters array elements using a predicate function
   
   ---
   *This issue was auto-generated from Spark reference documentation.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Feature] Support Spark expression: map_filter [datafusion-comet]

Reply via email to