adragomir commented on issue #19550: URL: https://github.com/apache/datafusion/issues/19550#issuecomment-3846075841
@adriangb you are completely correct with the identified issues: When we did our internal fork (that implements basically, probably with bugs the proposal in #11745 ), that's what we needed to solve: https://github.com/hstack/datafusion/blob/main/datafusion/optimizer/src/optimize_projections_deep.rs - maybe it helps for some ideas. Some notes: 1. The algorithm basically goes top to bottom on logical plans, at each level keeps the plan together with a `DeepColumnMap` - the equivalent of `ColumnPath` - an index into the schema and the deep projections for that schema 2. At each level, we pass down what we computed to be needed. 3. Depending on the plan type - if it's a subquery alias for example, we try to "resolve" the references coming from the upper layer, or if it's a join, the same thing. 4. If it's a intermediary projection, that has no extra information (like in your example `Projection: col, other_col`), we don't update, since we look at the references at that column up in the plan, and they are not needed - we ignore them. 5. Of course, we need to recompute the schemas etc at each level. Some notes: 1. The way we got this to work is as a separate, "degenerate" optimization step - we couldn't get it to run more than once, and we couldn't put it in the optimize_projections step, which is where it should live, I think, but it was too complicated. 2. The way we push it to the parquet leaves - we never actually change the "structure" of the schema, from what I can tell this is the direction. That is, if we only need `struct['subfield1']['subfield2']`, we output from the TableScan a `struct{subfield1: struct{subfield2: ...}}}`, we don't "extract it", we couldn't figure out how to change all the schemas -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
