taiyang-li opened a new pull request, #11484:
URL: https://github.com/apache/incubator-gluten/pull/11484
### Motivation
In production we observed frequent patterns like
`array_contains(map_keys(mp), target)`. This incurs extra runtime cost: it
materializes the key array and scans it for each row. Spark already provides a
semantically equivalent primitive `map_contains_key(mp, target)` which can
avoid building the intermediate array and can be executed more efficiently in
backends.
This PR adds an optimizer rule to automatically rewrite
`array_contains(map_keys(m), k)` to `map_contains_key(m, k)` for Gluten
backends (ClickHouse and Velox), keeping Spark semantics (type coercion and
null handling) unchanged.
### Implementation
#### Logical rule
* Introduce a LogicalPlan-level rule `ArrayContainsMapKeysRewriteRule` under
`org.apache.gluten.extension.columnar` (in `gluten-substrait`).
* The rule pattern-matches on expressions of the form:
* `ArrayContains(MapKeys(m), k)`
and rewrites them to:
* `MapContainsKey(m, k)`
* The rule operates via `transformExpressionsUp` and only applies on
resolved plans, relying on Catalyst`s built-in `ArrayContains`, `MapKeys` and
`MapContainsKey` expressions for type coercion and nullability, so the
resulting expression keeps the same data type and nullability as the original
one.
#### Backend injections
**ClickHouse backend**
* In `CHRuleApi.injectSpark`, register the rule as an optimizer rule:
* `injector.injectOptimizerRule(_ => ArrayContainsMapKeysRewriteRule)`
* The registration style is consistent with existing logical expression
rewrites such as `EqualToRewrite`.
**Velox backend**
* In `VeloxRuleApi.injectSpark`, register the same rule as an optimizer rule:
* `injector.injectOptimizerRule(_ => ArrayContainsMapKeysRewriteRule)`
* This ensures the rewrite happens at the logical optimization stage before
offloading to the Velox backend.
### Testing
**Velox scalar function validation**
* Extend `ScalarFunctionsValidateSuite` with a new test (Spark >= 3.3) for
`Map[Int, String]`:
* Build a small table containing cases where:
* key exists (e.g. key = 1)
* key does not exist (e.g. key = 5)
* empty map
* null map
* For each row, compute three boolean columns and assert they are all
`true`:
* `array_contains(map_keys(i), 1) <=> map_contains_key(i, 1)` (existing
key)
* `array_contains(map_keys(i), 5) <=> map_contains_key(i, 5)`
(non-existing key)
* `array_contains(map_keys(i), cast(null as int)) <=>
map_contains_key(i, cast(null as int))` (NULL key)
* The test is executed via `runQueryAndCompare` to guarantee parity
between vanilla Spark and Gluten+Velox, and verifies that the executed plan
contains `ProjectExecTransformer`.
### Build
* Locally ran (with tests skipped):
* `mvn -q -DskipTests -T 1C package`
* The build and Spotless formatting both passed; no additional compile or
style issues were observed.
### Semantics
* The rewrite is purely at the expression level and keeps Spark`s semantic
behavior for:
* Type coercion between the map key type and the lookup key expression
type (e.g. `INT` vs `DOUBLE`).
* Nullability: null maps and/or null keys retain the same behavior as the
original `array_contains(map_keys(m), k)` expression.
* The rule *only* replaces `array_contains(map_keys(m), k)` by the canonical
`map_contains_key(m, k)` builtin and does not introduce any custom UDF or
backend-specific expression, so behavior remains aligned with Spark across all
supported Spark versions in Gluten.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]