berkaysynnada opened a new issue, #9111:
URL: https://github.com/apache/arrow-datafusion/issues/9111
### Is your feature request related to a problem or challenge?
In the current version of `ProjectionPushdown`, there are some algorithmic
limitations and it is not very friendly to be extendable. To solve this
optimization in theoretical limits within the strategy of "pushing down" is not
possible. We will need another approach. Also any custom plan should be easily
integrable with the optimization.
### Describe the solution you'd like
The new rule aims achieving the most effective use of projections in plans.
It will ensures that query plans are free from unnecessary projections and that
no unused columns are propagated unnecessarily between plans. The rule is
designed to enhance query performance by:
1. Preventing the transfer of unused columns from leaves to root.
2. Ensuring projections are used only when they contribute to narrowing the
schema, or when necessary for evaluation or aliasing.
The optimization is conducted in two phases:
Top-down Phase:
---------------
- Traverses the plan from root to leaves. If the node is:
1. Projection node, it may:
a) Merge it with its input projection if merge is beneficial.
b) Remove the projection if it is redundant.
c) Narrow the Projection if possible.
d) The projection can be nested into the source.
e) Do nothing, otherwise.
2. Non-Projection node:
a) Schema needs pruning. Insert the necessary projections to the
children.
b) All fields are required. Do nothing.
Bottom-up Phase:
----------------
This pass is required because modifying a plan node can change the column
indices used by output nodes. When such a change occurs, we store the old and
new indices of the columns in the node's state. We then proceed from the leaves
to the root, updating the indices of columns in the plans by referencing these
mapping records. After the top-down phase, also some unnecessary projections
may emerge. When projections check its input schema mapping, it can remove
itself and assign new schema mapping to the new node which was the projection's
input formerly.
The designed node structure is:
```
struct ProjectionOptimizer {
pub plan: Arc<dyn ExecutionPlan>,
/// The node above expects it can reach these columns.
pub required_columns: HashSet<Column>,
/// The nodes above will be updated according to these mathces. First
element indicates
/// the initial column index, and the second element is for the updated
version.
pub schema_mapping: HashMap<Column, Column>,
pub children_nodes: Vec<ProjectionOptimizer>,
}
```
To summarize, with two state variables for each plan node (one for
transferring the required columns and one for the notifying changes of column
indices), and with two passes (it is actually one pass, the bottom-up pass will
be done implicitly during the attachment of transformed children to the self
node), we will have a future-proof rule.
### Describe alternatives you've considered
_No response_
### Additional context
I am currently working on this issue. I will plan to open a PR for
suggestions, especially on how to update the ExecutionPlan API to get rid of if
else structure of all plans. It will be ready likely next week. It's not
expected to significantly alter our current plans, but it will be a solid step
towards optimizing potential outcomes following other existing optimizations
and future ones.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]