adriangb commented on issue #22487:
URL: https://github.com/apache/datafusion/issues/22487#issuecomment-4528893952

   Investigated this on the #22239 branch (freshly built `datafusion-cli`, 
traced the physical optimizer with `EXPLAIN VERBOSE`). The short version: **the 
logical plan is already optimal — the pessimization is introduced at the 
physical level by `EnforceSorting`, not by anything in the logical layer.** 
That changes the recommended fix.
   
   ## The logical plan already sorts on the flat base column
   
   For `SELECT named_struct('a',a,'b',b) AS s FROM ordered ORDER BY s['a']`, 
the optimized **logical** plan is:
   
   ```
   Projection: s                                                                
  (trivial recovery)
     Sort: __datafusion_extracted_1 ASC NULLS LAST                              
  <- sorts on a FLAT column
       Projection: named_struct('a',a,'b',b) AS s, ordered.a AS 
__datafusion_extracted_1
         TableScan: ordered projection=[a, b]
   ```
   
   The simplifier folded the extracted `get_field(named_struct(...),'a')` down 
to `ordered.a`, so the sort key is the flat column `a` (aliased 
`__datafusion_extracted_1`). The `initial_physical_plan` preserves this and 
sorts on `__datafusion_extracted_1@1` — effectively the "ideal" from the 
description.
   
   ## `EnforceSorting` rewrites the flat sort key into `get_field(s, 'a')`
   
   Tracing the physical passes:
   
   | Pass | Sort key |
   |------|----------|
   | `initial_physical_plan` … `after OutputRequirements` | 
`__datafusion_extracted_1@1` (flat `a`) |
   | **`after EnforceSorting`** | **`get_field(s@0, a)`** |
   | `after ProjectionPushdown` | `get_field(s@0, a)`; flat column dropped from 
the scan |
   
   `EnforceSorting` reconstructs the `SortExec` and picks `get_field(s@0,'a')` 
as the equivalence-class representative over the flat 
`__datafusion_extracted_1`. The equivalence `a@0 ≡ get_field(s@0,'a')` is the 
one #21218 ("Propagate orderings through struct-producing projections") 
installs via `ScalarUDFImpl::struct_field_mapping` + `ProjectionMapping`. 
`ProjectionPushdown` then drops the now-unused flat column, giving the final 
plan.
   
   The simplifier rule didn't *directly* cause this — it collapsed the logical 
plan into a single clean projection (`[named_struct AS s, a AS 
__extracted_1]`), and that shape is exactly what lets #21218's struct-field 
equivalence fire at the physical level. (Pre-rule, CSE split the struct into a 
separate lower projection, so `EnforceSorting` never found a `get_field(s,'a')` 
representative to swap in, and the final plan kept the flat sort key.) Two 
features interacting.
   
   ## Performance: it's a wash, not a regression
   
   - **Pre-#22239 plan:** scan pre-computes `get_field(named_struct(...),'a')` 
into a flat column, carries it through the sort as extra payload, recovery 
projection drops it → one `get_field` pass **plus** an extra column reordered 
through the sort.
   - **Current plan:** scan builds `s`; `SortExec` computes `get_field(s,'a')` 
only as the sort key (not carried as output) → one `get_field` pass, **no** 
extra payload, one fewer projection node.
   - **"Ideal" plan:** sort on raw `a`, build `s` afterward → saves the single 
`get_field` pass.
   
   `get_field` over a `StructArray` returns the child array (an `Arc` clone, 
plus a null-buffer merge if the struct is nullable) — O(n) with a tiny 
constant, negligible against the O(n·log n) sort. So the current plan does the 
**same `get_field` work the pre-rule plan already did**, with a simpler shape 
and slightly less data carried through the sort; the "ideal" plan would save 
only that one cheap pass. Confirms the original note: a tradeoff, not a clear 
regression.
   
   ## Implication for the fix
   
   This rules out the "logical rule to look through the projection boundary" 
option — the logical plan already sorts on the flat base column. The only lever 
that matters is physical:
   
   - **Cheapest targeted fix:** bias `EnforceSorting`'s 
equivalence-representative selection to prefer a plain `Column` over a compound 
`get_field(...)` when both are in the class. Generic machinery, so it needs 
care, but it'd likely help other struct-projection-over-sort cases too.
   - **The literal "ideal" plan** (struct built *above* the sort) additionally 
needs sinking the `named_struct` projection past the `SortExec` — a 
projection-reorder optimization, more involved, for negligible gain.
   
   **Recommendation:** low priority / arguably won't-fix. The current plan is 
no slower than before #22239 and is structurally simpler. If pursued, the 
target is `EnforceSorting`'s representative choice, not a new logical rule.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to