brijrajk commented on code in PR #12151:
URL: https://github.com/apache/gluten/pull/12151#discussion_r3500310173
##########
gluten-ut/spark40/src/test/resources/backends-velox/tpcds-plan-stability/gluten-approved-plans-modified/q59.sf100/simplified.txt:
##########
@@ -23,52 +23,42 @@ VeloxColumnarToRow
InputIteratorTransformer
InputAdapter
ColumnarBroadcastExchange #2
- WholeStageCodegenTransformer (3)
- FilterExecTransformer
[d_date_sk,d_week_seq]
- Subquery #1
- VeloxColumnarToRow
-
WholeStageCodegenTransformer (2)
-
RegularHashAggregateExecTransformer [bloom_filter_agg(xxhash64(d_week_seq, 42),
335, 8990, 0, 0),bloomFilter,buf,buf]
-
InputIteratorTransformer
- InputAdapter
- ColumnarExchange
#3
-
VeloxResizeBatches
-
WholeStageCodegenTransformer (1)
-
FlushableHashAggregateExecTransformer [_pre_x] [buf,buf,buf]
+ RowToVeloxColumnar
+ WholeStageCodegen (1)
+ Filter [d_date_sk,d_week_seq]
+ Subquery #1
+ ObjectHashAggregate
[buf] [bloom_filter_agg(xxhash64(d_week_seq, 42), 335, 8990, 0,
0),bloomFilter,buf]
+ VeloxColumnarToRow
+ ColumnarExchange #3
+ VeloxResizeBatches
+
RowToVeloxColumnar
+
ObjectHashAggregate [d_week_seq] [buf,buf]
+
VeloxColumnarToRow
+
WholeStageCodegenTransformer (1)
Review Comment:
@zhztheplayer Yes, it is in the final executed plan — and it is correct.
Here's the breakdown:
**`bloom_filter_agg` is the logical attribute name, not the physical
function class.** It comes from `AggregateExpression.resultAttribute` (a `lazy
val` frozen at logical→physical conversion time from
`BloomFilterAggregate.prettyName`). It never updates after construction, which
is why both the pre-PR golden (`RegularHashAggregateExecTransformer`) and the
post-PR golden (`ObjectHashAggregateExec`) show the same `bloom_filter_agg`
attribute name.
**The physical function inside `ObjectHashAggregateExec` is
`VeloxBloomFilterAggregate`, not vanilla `BloomFilterAggregate`.**
`BloomFilterRuntimeFilterRewriteRule` rewrites it at position 10 via
`subq.withNewPlan(rewrittenPlan)`. `VeloxBloomFilterAggregate.eval()`
unconditionally calls `serialize(buffer)` → produces Velox-format (version=1)
bytes even in JVM mode. `VeloxBloomFilterMightContain` on the outer side reads
them correctly.
**Why `ObjectHashAggregateExec` instead of
`RegularHashAggregateExecTransformer`:** `PlanSubqueries` (position 2 in
`QueryExecution.preparations`) prepares the subquery plan independently, before
`ApplyColumnarRulesAndInsertTransitions` (position 10) runs on the main plan.
At position 2 the standalone subquery still has vanilla `BloomFilterAggregate`
— `HeuristicTransform` finds no Substrait mapping and leaves it as
`ObjectHashAggregateExec`. `BloomFilterRuntimeFilterRewriteRule` substitutes
`VeloxBloomFilterAggregate` into the already-prepared subquery at position 10
of the main plan, but `HeuristicTransform` does not re-process it. Hence JVM
mode, but with `VeloxBloomFilterAggregate` inside.
**Why not restore native offloading:** doing so would require substituting
`BloomFilterAggregate → VeloxBloomFilterAggregate` during subquery preparation
— which also fires for the SPARK-54336 case (`might_contain(empty_subquery,
literal)`). There, vanilla `BloomFilterAggregate` must return `null` for empty
input; `VeloxBloomFilterAggregate` has no cardinality guard and returns
non-null bytes, changing `null → false`. The pre-PR code had this regression
silently; this PR trades native offloading of the runtime-filter aggregate for
correct SPARK-54336 semantics.
**Runtime verification (added in latest commit):**
`GlutenBloomFilterFallbackSuite` now has a test (`"GLUTEN-12013:
VeloxBloomFilterAggregate in JVM subquery produces correct Velox bytes"`) that
explicitly checks `collectWithSubqueries(executedPlan)` for
`ObjectHashAggregateExec` nodes containing `VeloxBloomFilterAggregate` and
asserts it is non-empty, then verifies the query returns 200,003 rows with no
version-mismatch exception. All 6 fallback tests + 14
`GlutenBloomFilterAggregateQuerySuite` tests (including the SPARK-54336
null-on-empty-input case) pass locally.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]