cloud-fan opened a new pull request, #55719:
URL: https://github.com/apache/spark/pull/55719
### What changes were proposed in this pull request?
Followup to SPARK-56482 (#55425). Two groups of changes to `UnionExec`'s
whole-stage codegen path.
**Code cleanness:**
- Hoist `metricTerm("numOutputRows")` to `doProduce` and store it on the
instance. `doConsume` runs once per child during emission, so the previous code
registered the same metric N times in `references[]` for an N-child Union; now
once.
- Drop the dead `assert` in `perChildProjections` and the duplicate
`allChildOutputDataTypesMatch` lazy val. The dataType comparison now has a
single source of truth in the `type-mismatch` branch of the gate.
- Inline the one-shot `hasAnyPartitionIndexDependentDescendant` lazy val.
- Drop the unreachable `case other` in the `UnionPartition` match and
replace with `asInstanceOf`. `unionedInputRDD` is built as `new UnionRDD(...)`
two lines up, and `getPartitions` only ever returns `UnionPartition[_]`.
- Factor `isPlainUnion` helper used by the gate and `doExecute` so the
invariant "codegen path matches `sparkContext.union` semantics" lives in one
place.
- Hoist the child-local idx to a `childLocalIdx` local at helper entry.
References emitted by `RangeExec`/`SampleExec` now read a plain int instead of
re-evaluating `((int[]) refs[K])[partitionIndex]` per use.
- Drop the `try/finally` around codegen state restoration. Codegen failure
aborts the whole stage, so the restoration is unreachable.
**Gate narrowing:**
- Narrow `hasPartitionIndexDependentCodegen` to exclude `InputFileName`,
`InputFileBlockStart`, and `InputFileBlockLength`. These are `Nondeterministic`
but read from `InputFileBlockHolder` (a per-task thread-local) and do not embed
`partitionIndex`, so they are safe under fusion. Queries like `SELECT
input_file_name() FROM a UNION ALL SELECT input_file_name() FROM b` now fuse.
### Why are the changes needed?
The cleanups remove accidental complexity in the fused code path: an N-fold
metric reference, two duplicated dataType comparisons, an unreachable defensive
guard, a per-iteration array deref, and a `try/finally` that protects against
an unreachable case. The gate narrowing turns a missed optimization (file-scan
unions) into a fused plan.
### Does this PR introduce _any_ user-facing change?
No. `spark.sql.codegen.wholeStage.union.enabled` remains off by default;
when on, the new behavior fuses additional plans (file-scan unions with
`input_file_name()`) that the previous gate over-rejected.
### How was this patch tested?
`UnionCodegenSuite`, `UnionCodegenAnsiSuite`, `UnionCodegenAqeSuite`, and
the relevant `SQLMetricsSuite` test all pass. Two tests added:
- `partitioning-aware union falls back to non-codegen` — covers a
`supportCodegenFailureReason` branch that lacked explicit coverage.
- `input_file_name child fuses (Nondeterministic but partition-index-free)`
— validates the gate narrowing.
The `columnar` fallback branch is not covered by a new test: reliably
constructing a plan where `Union.supportsColumnar` is true via the user-facing
API turned out to be brittle, since `ApplyColumnarRulesAndInsertTransitions`
aggressively rebalances columnar/row transitions.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]