mustafasrepo commented on PR #10095:
URL: https://github.com/apache/datafusion/pull/10095#issuecomment-2074203264
> > Maybe we can insert RepartitionExec on top UnionExecs if their output
partition number > config.target_partitions. By this way, we can guarantee this
violation wouldn't propagate to other operators.
>
> Probably better solution would be planning union inputs execution
according to total available partitions -- e.g
>
> ```sql
> select l_linenumber as f
> from lineitem
> union all
> select l_orderkey as f
> from lineitem
> ```
>
> with target_partitions = 4, could plan 2 threads for each ParquetExec
(ideally we could also use byte/row statistics and plan according to them --
not only 2-2, but probably 1-3 if there is significant data skew across
inputs/files).
>
> Currently, with target_partitions = 4, it's planned as 4 threads per
ParquetExec, and 8 output partitions for UNION.
>
> And on top of it, when target_partitions is less then number of UNION
inputs (e.g. UNION has 10 inputs, target_partitions = 4, and we need at least 1
thread for each input) there could be RepartitionExec.
That might work. However, this approach cannot solve all cases I guess. For
the following query
```sql
select * from table
union all
select * from table
union all
select * from table
```sql
when `target_partitions=2`, there is no way to change input partitions to
generate 2 partitions after `union`. We would generate at least `3`
partitions(assuming each input generates single partition). Hence, your
suggestion may not solve all cases.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]