adriangb opened a new issue, #22487:
URL: https://github.com/apache/datafusion/issues/22487

   ## Background
   
   Follow-up from review of #22239, which added a simplifier rule that resolves 
`get_field` over an inline struct constructor at plan time 
(`get_field(named_struct('a', a, 'b', b), 'a') => a`).
   
   That rule changed the physical plan for queries that sort by a struct field. 
Consider:
   
   ```sql
   EXPLAIN SELECT named_struct('a', a, 'b', b) AS s FROM ordered ORDER BY 
s['a'];
   ```
   
   where `ordered` is `WITH ORDER (a + b)` (so it is *not* ordered by `a`, and 
a `SortExec` is genuinely required).
   
   **Before the simplifier rule** — `ExtractLeafExpressions` pulled the 
`get_field` sort key into a scan-level projection (`__datafusion_extracted_1`), 
the sort ran on that flat column, and a recovery projection hid the extra 
column:
   
   ```
   01)ProjectionExec: expr=[s@0 as s]
   02)--SortExec: expr=[__datafusion_extracted_1@1 ASC NULLS LAST], 
preserve_partitioning=[false]
   03)----DataSourceExec: projection=[named_struct(a, a@0, b, b@1) as s, 
get_field(named_struct(a, a@0, b, b@1), a) as __datafusion_extracted_1], ...
   ```
   
   **After the simplifier rule** — the inline `get_field` is folded away, the 
extract/recover dance collapses, and what remains is the `ORDER BY` referencing 
the *output struct column*. Here `s@0` is a column reference (not an inline 
`named_struct`), so the simplifier cannot fold it:
   
   ```
   01)SortExec: expr=[get_field(s@0, a) ASC NULLS LAST], 
preserve_partitioning=[false]
   02)--DataSourceExec: projection=[named_struct(a, a@0, b, b@1) as s], ...
   ```
   
   The plan is simpler (no extra column, no recovery projection), but the sort 
key is now a per-array struct-field extraction over the materialized struct 
rather than a flat column read.
   
   ## Opportunity
   
   The ideal plan sorts on the base column *before* constructing the struct, 
avoiding struct-field extraction in the sort key entirely:
   
   ```
   01)ProjectionExec: expr=[named_struct(a, a@0, b, b@1) as s]
   02)--SortExec: expr=[a@0 ASC NULLS LAST]
   03)----DataSourceExec: projection=[a, b]
   ```
   
   To get there, an optimizer pass would need to look *through* the child 
projection that defines `s = named_struct('a', a, 'b', b)` and rewrite 
`get_field(s@0, 'a')` back to the underlying `a` — essentially the inverse of 
CSE across a projection boundary. This is more than a local sort-key 
normalization because it also wants to reorder the struct-construction 
projection above the sort.
   
   ## Notes / open questions
   
   - This is a **tradeoff, not a clear regression**: DataFusion materializes 
sort keys into an array once before sorting (not per comparison), so the cost 
is one struct-field extraction pass vs. a flat column read, partly offset by 
the dropped extra column + recovery projection. Worth a benchmark before 
investing.
   - Should this be a logical rule (rewrite `get_field(col, f)` when `col`'s 
defining projection is a struct constructor) or a physical sort-key 
normalization? The projection boundary is the crux either way.
   - See the `# Sort elimination through named_struct projections` section in 
`datafusion/sqllogictest/test_files/order.slt` for the current behavior.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to