crepererum commented on issue #5970:
URL:
https://github.com/apache/arrow-datafusion/issues/5970#issuecomment-1508165252
> > > I would argue that UnionExec should NEVER modify its inputs but just
be a plain, simple node that forwards its inputs w/o messing up sorting (or any
other property).
> >
> >
> > I agree with this sentiment -- we already have `RepartitionExec` that
concatenates batches from different streams
>
> Does it mean that the `UnionExec` will return the inputs sequentially
(i.e. concatenated) or will it potentially interleave the inputs, whilst
maintaining the ordering?
So what it SHOULD do (IMHO) and what it also does in the majority of the
cases (i.e. when the `partition_aware = false`):
```yaml
---
# Plan:
UnionExec:
children:
- SomeChild:
output_partitions:
- [batch111, batch112]
- [batch121, batch122]
- SomeChild:
output_partitions:
- [batch211, batch212]
- [batch221, batch222]
---
# Equivalent pseudo-plan
UnionExec:
output_partitions:
- [batch111, batch112]
- [batch121, batch122]
- [batch211, batch212]
- [batch221, batch222]
```
However what it actually does with `partition_aware = true`:
```yaml
---
UnionExec:
output_partitions:
- CombinedRecordBatchStream:
- [batch111, batch112]
- [batch211, batch212]
- CombinedRecordBatchStream:
- [batch121, batch122]
- [batch221, batch222]
---
# May yield (if lucky):
UnionExec:
output_partitions:
- [batch111, batch112, batch211, batch212]
- [batch121, batch122, batch221, batch222]
---
# May also yield (if not so lucky):
UnionExec:
output_partitions:
- [batch111, batch211, batch112, batch212]
- [batch221, batch222, batch121, batch122]
```
The exact logical of `CombinedRecordBatchStream ` can be found here:
https://github.com/apache/arrow-datafusion/blob/fcd8b899e2a62f798413c536f47078289ece9d05/datafusion/core/src/physical_plan/union.rs#L364-L408
This shuffling obviously confuses the `SortPreservingMergeExec` logic
because it assumes that the inputs are sorted.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]