adriangb opened a new issue, #22487:
URL: https://github.com/apache/datafusion/issues/22487
## Background
Follow-up from review of #22239, which added a simplifier rule that resolves
`get_field` over an inline struct constructor at plan time
(`get_field(named_struct('a', a, 'b', b), 'a') => a`).
That rule changed the physical plan for queries that sort by a struct field.
Consider:
```sql
EXPLAIN SELECT named_struct('a', a, 'b', b) AS s FROM ordered ORDER BY
s['a'];
```
where `ordered` is `WITH ORDER (a + b)` (so it is *not* ordered by `a`, and
a `SortExec` is genuinely required).
**Before the simplifier rule** — `ExtractLeafExpressions` pulled the
`get_field` sort key into a scan-level projection (`__datafusion_extracted_1`),
the sort ran on that flat column, and a recovery projection hid the extra
column:
```
01)ProjectionExec: expr=[s@0 as s]
02)--SortExec: expr=[__datafusion_extracted_1@1 ASC NULLS LAST],
preserve_partitioning=[false]
03)----DataSourceExec: projection=[named_struct(a, a@0, b, b@1) as s,
get_field(named_struct(a, a@0, b, b@1), a) as __datafusion_extracted_1], ...
```
**After the simplifier rule** — the inline `get_field` is folded away, the
extract/recover dance collapses, and what remains is the `ORDER BY` referencing
the *output struct column*. Here `s@0` is a column reference (not an inline
`named_struct`), so the simplifier cannot fold it:
```
01)SortExec: expr=[get_field(s@0, a) ASC NULLS LAST],
preserve_partitioning=[false]
02)--DataSourceExec: projection=[named_struct(a, a@0, b, b@1) as s], ...
```
The plan is simpler (no extra column, no recovery projection), but the sort
key is now a per-array struct-field extraction over the materialized struct
rather than a flat column read.
## Opportunity
The ideal plan sorts on the base column *before* constructing the struct,
avoiding struct-field extraction in the sort key entirely:
```
01)ProjectionExec: expr=[named_struct(a, a@0, b, b@1) as s]
02)--SortExec: expr=[a@0 ASC NULLS LAST]
03)----DataSourceExec: projection=[a, b]
```
To get there, an optimizer pass would need to look *through* the child
projection that defines `s = named_struct('a', a, 'b', b)` and rewrite
`get_field(s@0, 'a')` back to the underlying `a` — essentially the inverse of
CSE across a projection boundary. This is more than a local sort-key
normalization because it also wants to reorder the struct-construction
projection above the sort.
## Notes / open questions
- This is a **tradeoff, not a clear regression**: DataFusion materializes
sort keys into an array once before sorting (not per comparison), so the cost
is one struct-field extraction pass vs. a flat column read, partly offset by
the dropped extra column + recovery projection. Worth a benchmark before
investing.
- Should this be a logical rule (rewrite `get_field(col, f)` when `col`'s
defining projection is a struct constructor) or a physical sort-key
normalization? The projection boundary is the crux either way.
- See the `# Sort elimination through named_struct projections` section in
`datafusion/sqllogictest/test_files/order.slt` for the current behavior.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]