adriangb commented on issue #22487:
URL: https://github.com/apache/datafusion/issues/22487#issuecomment-4528893952
Investigated this on the #22239 branch (freshly built `datafusion-cli`,
traced the physical optimizer with `EXPLAIN VERBOSE`). The short version: **the
logical plan is already optimal — the pessimization is introduced at the
physical level by `EnforceSorting`, not by anything in the logical layer.**
That changes the recommended fix.
## The logical plan already sorts on the flat base column
For `SELECT named_struct('a',a,'b',b) AS s FROM ordered ORDER BY s['a']`,
the optimized **logical** plan is:
```
Projection: s
(trivial recovery)
Sort: __datafusion_extracted_1 ASC NULLS LAST
<- sorts on a FLAT column
Projection: named_struct('a',a,'b',b) AS s, ordered.a AS
__datafusion_extracted_1
TableScan: ordered projection=[a, b]
```
The simplifier folded the extracted `get_field(named_struct(...),'a')` down
to `ordered.a`, so the sort key is the flat column `a` (aliased
`__datafusion_extracted_1`). The `initial_physical_plan` preserves this and
sorts on `__datafusion_extracted_1@1` — effectively the "ideal" from the
description.
## `EnforceSorting` rewrites the flat sort key into `get_field(s, 'a')`
Tracing the physical passes:
| Pass | Sort key |
|------|----------|
| `initial_physical_plan` … `after OutputRequirements` |
`__datafusion_extracted_1@1` (flat `a`) |
| **`after EnforceSorting`** | **`get_field(s@0, a)`** |
| `after ProjectionPushdown` | `get_field(s@0, a)`; flat column dropped from
the scan |
`EnforceSorting` reconstructs the `SortExec` and picks `get_field(s@0,'a')`
as the equivalence-class representative over the flat
`__datafusion_extracted_1`. The equivalence `a@0 ≡ get_field(s@0,'a')` is the
one #21218 ("Propagate orderings through struct-producing projections")
installs via `ScalarUDFImpl::struct_field_mapping` + `ProjectionMapping`.
`ProjectionPushdown` then drops the now-unused flat column, giving the final
plan.
The simplifier rule didn't *directly* cause this — it collapsed the logical
plan into a single clean projection (`[named_struct AS s, a AS
__extracted_1]`), and that shape is exactly what lets #21218's struct-field
equivalence fire at the physical level. (Pre-rule, CSE split the struct into a
separate lower projection, so `EnforceSorting` never found a `get_field(s,'a')`
representative to swap in, and the final plan kept the flat sort key.) Two
features interacting.
## Performance: it's a wash, not a regression
- **Pre-#22239 plan:** scan pre-computes `get_field(named_struct(...),'a')`
into a flat column, carries it through the sort as extra payload, recovery
projection drops it → one `get_field` pass **plus** an extra column reordered
through the sort.
- **Current plan:** scan builds `s`; `SortExec` computes `get_field(s,'a')`
only as the sort key (not carried as output) → one `get_field` pass, **no**
extra payload, one fewer projection node.
- **"Ideal" plan:** sort on raw `a`, build `s` afterward → saves the single
`get_field` pass.
`get_field` over a `StructArray` returns the child array (an `Arc` clone,
plus a null-buffer merge if the struct is nullable) — O(n) with a tiny
constant, negligible against the O(n·log n) sort. So the current plan does the
**same `get_field` work the pre-rule plan already did**, with a simpler shape
and slightly less data carried through the sort; the "ideal" plan would save
only that one cheap pass. Confirms the original note: a tradeoff, not a clear
regression.
## Implication for the fix
This rules out the "logical rule to look through the projection boundary"
option — the logical plan already sorts on the flat base column. The only lever
that matters is physical:
- **Cheapest targeted fix:** bias `EnforceSorting`'s
equivalence-representative selection to prefer a plain `Column` over a compound
`get_field(...)` when both are in the class. Generic machinery, so it needs
care, but it'd likely help other struct-projection-over-sort cases too.
- **The literal "ideal" plan** (struct built *above* the sort) additionally
needs sinking the `named_struct` projection past the `SortExec` — a
projection-reorder optimization, more involved, for negligible gain.
**Recommendation:** low priority / arguably won't-fix. The current plan is
no slower than before #22239 and is structurally simpler. If pursued, the
target is `EnforceSorting`'s representative choice, not a new logical rule.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]