berkaysynnada opened a new issue, #9111:
URL: https://github.com/apache/arrow-datafusion/issues/9111

   ### Is your feature request related to a problem or challenge?
   
   In the current version of `ProjectionPushdown`, there are some algorithmic 
limitations and it is not very friendly to be extendable. To solve this 
optimization in theoretical limits within the strategy of "pushing down" is not 
possible. We will need another approach. Also any custom plan should be easily 
integrable with the optimization. 
   
   ### Describe the solution you'd like
   
   The new rule aims achieving the most effective use of projections in plans. 
It will ensures that query plans are free from unnecessary projections and that 
no unused columns are propagated unnecessarily between plans. The rule is 
designed to enhance query performance by:
   1. Preventing the transfer of unused columns from leaves to root.
   2. Ensuring projections are used only when they contribute to narrowing the 
schema, or when necessary for evaluation or aliasing.
   
   The optimization is conducted in two phases:
   Top-down Phase:
   ---------------
   - Traverses the plan from root to leaves. If the node is:
     1. Projection node, it may:
       a) Merge it with its input projection if merge is beneficial.
       b) Remove the projection if it is redundant.
       c) Narrow the Projection if possible.
       d) The projection can be nested into the source.
       e) Do nothing, otherwise.
     2. Non-Projection node:
       a) Schema needs pruning. Insert the necessary projections to the 
children.
       b) All fields are required. Do nothing.
   
   Bottom-up Phase:
   ----------------
   This pass is required because modifying a plan node can change the column 
indices used by output nodes. When such a change occurs, we store the old and 
new indices of the columns in the node's state. We then proceed from the leaves 
to the root, updating the indices of columns in the plans by referencing these 
mapping records. After the top-down phase, also some unnecessary projections 
may emerge. When projections check its input schema mapping, it can remove 
itself and assign new schema mapping to the new node which was the projection's 
input formerly.
   
   The designed node structure is:
   ```
   struct ProjectionOptimizer {
       pub plan: Arc<dyn ExecutionPlan>,
       /// The node above expects it can reach these columns.
       pub required_columns: HashSet<Column>,
       /// The nodes above will be updated according to these mathces. First 
element indicates
       /// the initial column index, and the second element is for the updated 
version.
       pub schema_mapping: HashMap<Column, Column>,
       pub children_nodes: Vec<ProjectionOptimizer>,
   }
   ```
   To summarize, with two state variables for each plan node (one for 
transferring the required columns and one for the notifying changes of column 
indices), and with two passes (it is actually one pass, the bottom-up pass will 
be done implicitly during the attachment of transformed children to the self 
node), we will have a future-proof rule.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   I am currently working on this issue. I will plan to open a PR for 
suggestions, especially on how to update the ExecutionPlan API to get rid of if 
else structure of all plans. It will be ready likely next week. It's not 
expected to significantly alter our current plans, but it will be a solid step 
towards optimizing potential outcomes following other existing optimizations 
and future ones.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to