adragomir commented on issue #19550:
URL: https://github.com/apache/datafusion/issues/19550#issuecomment-3846075841

   @adriangb you are completely correct with the identified issues: 
   When we did our internal fork (that implements basically, probably with bugs 
the proposal in #11745 ), that's what we needed to solve: 
https://github.com/hstack/datafusion/blob/main/datafusion/optimizer/src/optimize_projections_deep.rs
 - maybe it helps for some ideas. 
   Some notes: 
   1. The algorithm basically goes top to bottom on logical plans, at each 
level keeps the plan together with a `DeepColumnMap` - the equivalent of 
`ColumnPath` - an index into the schema and the deep projections for that schema
   2. At each level, we pass down what we computed to be needed. 
   3. Depending on the plan type - if it's a subquery alias for example, we try 
to "resolve" the references coming from the upper layer, or if it's a join, the 
same thing. 
   4. If it's a intermediary projection, that has no extra information (like in 
your example `Projection: col, other_col`), we don't update, since we look at 
the references at that column up in the plan, and they are not needed - we 
ignore them. 
   5. Of course, we need to recompute the schemas etc at each level. 
   
   Some notes: 
   1. The way we got this to work is as a separate, "degenerate" optimization 
step - we couldn't get it to run more than once, and we couldn't put it in the 
optimize_projections step, which is where it should live, I think, but it was 
too complicated. 
   2. The way we push it to the parquet leaves - we never actually change the 
"structure" of the schema, from what I can tell this is the direction. That is, 
if we only need `struct['subfield1']['subfield2']`, we output from the 
TableScan a `struct{subfield1: struct{subfield2: ...}}}`, we don't "extract 
it", we couldn't figure out how to change all the schemas 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to