wiedld opened a new issue, #14691:
URL: https://github.com/apache/datafusion/issues/14691
### Describe the bug
During the EnforceSorting optimizer run, a valid plan may be turned invalid
due to the removal of a necessary coalesce. The result is a planning time
failure in the SanityChecker due to `does not satisfy distribution
requirements: HashPartitioned[[a@0]]). Child-0 output partitioning:
UnknownPartitioning(2)`.
We start with a valid input plan:
```
"SortExec: expr=[a@0 ASC], preserve_partitioning=[false]",
" AggregateExec: mode=SinglePartitioned, gby=[a@0 as a1], aggr=[]",
" CoalescePartitionsExec",
" ProjectionExec: expr=[a@0 as a, b@1 as value]",
" UnionExec",
" DataSourceExec: file_groups={1 group: [[x]]}, projection=[a, b,
c, d, e], file_type=parquet",
" DataSourceExec: file_groups={1 group: [[x]]}, projection=[a, b,
c, d, e], file_type=parquet"
```
And a coalesce is removed to make it invalid:
```
"SortPreservingMergeExec: [a@0 ASC]",
" SortExec: expr=[a@0 ASC], preserve_partitioning=[true]",
" AggregateExec: mode=SinglePartitioned, gby=[a@0 as a1], aggr=[]",
" ProjectionExec: expr=[a@0 as a, b@1 as value]",
" UnionExec",
" DataSourceExec: file_groups={1 group: [[x]]}, projection=[a, b,
c, d, e], file_type=parquet",
" DataSourceExec: file_groups={1 group: [[x]]}, projection=[a, b,
c, d, e], file_type=parquet",
```
### To Reproduce
A test case demonstrates this:
https://github.com/apache/datafusion/pull/14637/commits/670eff35bce04efdc163ce7823437691aa9f29f6
### Expected behavior
EnforceSorting should not take a valid plan, and make it invalid -- and then
failing the planning sanity check.
### Additional context
We already have a proposed solution:
https://github.com/apache/datafusion/pull/14637
While debugging, I did a minor refactor to `paralelize_sorts` and its helper
`remove_bottleneck_in_subplan`. The reason for the refactor ([also summarized
here](https://github.com/apache/datafusion/pull/14637#discussion_r1957023902)),
was that I noticed a pattern of several necessary nodes being removed -- and
then added back later. I elected to simplify the code (IMO) by tightening up
how we build the `PlanWithCorrespondingCoalescePartitions`, in order to
correctly identify want nodes should be removed in the first place. Instead of
removing and then adding back. The refactor is isolated in this commit:
https://github.com/apache/datafusion/pull/14637/commits/0661ed7e8934e7f2a711416b85cbafde2a7b99e2
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]