[PR] [Optimizer] Rewrite array_contains(map_keys(m), k) to map_contains_key(m, k) [incubator-gluten]

via GitHub Sat, 24 Jan 2026 03:12:45 -0800


taiyang-li opened a new pull request, #11484:
URL: https://github.com/apache/incubator-gluten/pull/11484


   ### Motivation
   
   In production we observed frequent patterns like 
`array_contains(map_keys(mp), target)`. This incurs extra runtime cost: it 
materializes the key array and scans it for each row. Spark already provides a 
semantically equivalent primitive `map_contains_key(mp, target)` which can 
avoid building the intermediate array and can be executed more efficiently in 
backends.
   
   This PR adds an optimizer rule to automatically rewrite 
`array_contains(map_keys(m), k)` to `map_contains_key(m, k)` for Gluten 
backends (ClickHouse and Velox), keeping Spark semantics (type coercion and 
null handling) unchanged.
   
   ### Implementation
   
   #### Logical rule
   * Introduce a LogicalPlan-level rule `ArrayContainsMapKeysRewriteRule` under 
`org.apache.gluten.extension.columnar` (in `gluten-substrait`).
   * The rule pattern-matches on expressions of the form:
     * `ArrayContains(MapKeys(m), k)`
     and rewrites them to:
     * `MapContainsKey(m, k)`
   * The rule operates via `transformExpressionsUp` and only applies on 
resolved plans, relying on Catalyst`s built-in `ArrayContains`, `MapKeys` and 
`MapContainsKey` expressions for type coercion and nullability, so the 
resulting expression keeps the same data type and nullability as the original 
one.
   
   #### Backend injections
   
   **ClickHouse backend**
   * In `CHRuleApi.injectSpark`, register the rule as an optimizer rule:
     * `injector.injectOptimizerRule(_ => ArrayContainsMapKeysRewriteRule)`
   * The registration style is consistent with existing logical expression 
rewrites such as `EqualToRewrite`.
   
   **Velox backend**
   * In `VeloxRuleApi.injectSpark`, register the same rule as an optimizer rule:
     * `injector.injectOptimizerRule(_ => ArrayContainsMapKeysRewriteRule)`
   * This ensures the rewrite happens at the logical optimization stage before 
offloading to the Velox backend.
   
   ### Testing
   
   **Velox scalar function validation**
   * Extend `ScalarFunctionsValidateSuite` with a new test (Spark >= 3.3) for 
`Map[Int, String]`:
     * Build a small table containing cases where:
       * key exists (e.g. key = 1)
       * key does not exist (e.g. key = 5)
       * empty map
       * null map
     * For each row, compute three boolean columns and assert they are all 
`true`:
       * `array_contains(map_keys(i), 1) <=> map_contains_key(i, 1)` (existing 
key)
       * `array_contains(map_keys(i), 5) <=> map_contains_key(i, 5)` 
(non-existing key)
       * `array_contains(map_keys(i), cast(null as int)) <=> 
map_contains_key(i, cast(null as int))` (NULL key)
     * The test is executed via `runQueryAndCompare` to guarantee parity 
between vanilla Spark and Gluten+Velox, and verifies that the executed plan 
contains `ProjectExecTransformer`.
   
   ### Build
   
   * Locally ran (with tests skipped):
     * `mvn -q -DskipTests -T 1C package`
   * The build and Spotless formatting both passed; no additional compile or 
style issues were observed.
   
   ### Semantics
   
   * The rewrite is purely at the expression level and keeps Spark`s semantic 
behavior for:
     * Type coercion between the map key type and the lookup key expression 
type (e.g. `INT` vs `DOUBLE`).
     * Nullability: null maps and/or null keys retain the same behavior as the 
original `array_contains(map_keys(m), k)` expression.
   * The rule *only* replaces `array_contains(map_keys(m), k)` by the canonical 
`map_contains_key(m, k)` builtin and does not introduce any custom UDF or 
backend-specific expression, so behavior remains aligned with Spark across all 
supported Spark versions in Gluten.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [Optimizer] Rewrite array_contains(map_keys(m), k) to map_contains_key(m, k) [incubator-gluten]

Reply via email to