alamb opened a new issue, #4968:
URL: https://github.com/apache/arrow-datafusion/issues/4968
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
`ProjectionExec` can either have computations like (`col1` + `col2`) or it
can be used to reorder / rename the columns
The first use case benefits from repartitioning (as then the calculation can
be done in multiple cores)
The second use case (ordering) does not benefit from partitioning as it is
simply a bookkeeping arrangement
Basically we have a plan like
```text
ProjectionExec: expr=[f@0 as f]
DeduplicateExec: [tag@1 ASC,time@2 ASC]
SortPreservingMergeExec: [tag@1 ASC,time@2 ASC]
UnionExec
```
That is then optimized by
https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_optimizer/repartition.rs
to repartition before the projection
```text
ProjectionExec: expr=[f@0 as f]
RepartitionExec: partitioning=RoundRobinBatch(4) <-- This repartition node
is likely worthless
DeduplicateExec: [tag@1 ASC,time@2 ASC]
SortPreservingMergeExec: [tag@1 ASC,time@2 ASC]
UnionExec
```
**Describe the solution you'd like**
This I think ProjectionExec should only "benefit from partitioning" when its
partition expressions actually have calculations (aka are not just columns /
aliases)
This would like defining `benefits_from_input_partitioning`
https://github.com/apache/arrow-datafusion/blob/906896b7c59ff14d71b3056ec4349274cf6662af/datafusion/core/src/physical_plan/mod.rs#L176-L183
For `impl ExecutionPlan for ProjectionExec`:
https://github.com/apache/arrow-datafusion/blob/906896b7c59ff14d71b3056ec4349274cf6662af/datafusion/core/src/physical_plan/projection.rs#L151
So that it returned true only if there were expressions that had non column
references / aliases
**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features
you've considered.
**Additional context**
I think this is a good first issue as the code and desire is fairly
straightforward and this would largely be an exercise in updating tests I
suspect
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]