logan-keede commented on issue #17261:
URL: https://github.com/apache/datafusion/issues/17261#issuecomment-3523066384
After some more sleuthing, I think I have narrowed down on the cause of this
problem
Consider the following query and its unoptimized Logical Plan:-
```explain analyze (SELECT c0, 1 as c1, 2 as c2, 3 as c3, 4 as c4, 5 as c5,
6 as c6, 7 as c7, 8 as c8, 9 as c9 FROM t1 ORDER BY c0)
UNION ALL
(SELECT 0 as c0, c1, 2 as c2, 3 as c3, 4 as c4, 5 as c5, 6 as c6, 7 as c7, 8
as c8, 9 as c9 FROM t1 ORDER BY c1)
UNION ALL
(SELECT 0 as c0, 1 as c1, c2, 3 as c3, 4 as c4, 5 as c5, 6 as c6, 7 as c7, 8
as c8, 9 as c9 FROM t1 ORDER BY c2)
UNION ALL
(SELECT 0 as c0, 1 as c1, 2 as c2, c3, 4 as c4, 5 as c5, 6 as c6, 7 as c7, 8
as c8, 9 as c9 FROM t1 ORDER BY c3)
UNION ALL
(SELECT 0 as c0, 1 as c1, 2 as c2, 3 as c3, c4, 5 as c5, 6 as c6, 7 as c7, 8
as c8, 9 as c9 FROM t1 ORDER BY c4)
UNION ALL
(SELECT 0 as c0, 1 as c1, 2 as c2, 3 as c3, 4 as c4, c5, 6 as c6, 7 as c7, 8
as c8, 9 as c9 FROM t1 ORDER BY c5)
UNION ALL
(SELECT 0 as c0, 1 as c1, 2 as c2, 3 as c3, 4 as c4, 5 as c5, c6, 7 as c7, 8
as c8, 9 as c9 FROM t1 ORDER BY c6)
UNION ALL
(SELECT 0 as c0, 1 as c1, 2 as c2, 3 as c3, 4 as c4, 5 as c5, 6 as c6, c7, 8
as c8, 9 as c9 FROM t1 ORDER BY c7)
UNION ALL
(SELECT 0 as c0, 1 as c1, 2 as c2, 3 as c3, 4 as c4, 5 as c5, 6 as c6, 7 as
c7, c8, 9 as c9 FROM t1 ORDER BY c8)
UNION ALL
(SELECT 0 as c0, 1 as c1, 2 as c2, 3 as c3, 4 as c4, 5 as c5, 6 as c6, 7 as
c7, 8 as c8, c9 FROM t1 ORDER BY c9)
ORDER BY c0, c1, c2, c3, c4, c5, c6, c7, c8, c9;
Unoptimized Logical Plan
Analyze
Sort: c0 ASC NULLS LAST, c1 ASC NULLS LAST, c2 ASC NULLS LAST, c3 ASC
NULLS LAST, c4 ASC NULLS LAST, c5 ASC NULLS LAST, c6 ASC NULLS LAST, c7 ASC
NULLS LAST, c8 ASC NULLS LAST, c9 ASC NULLS LAST
Union
Union
Union
Union
Union
Union
Union
Union
Union
Sort: t1.c0 ASC NULLS LAST
Projection: t1.c0, Int64(1) AS c1, Int64(2) AS c2,
Int64(3) AS c3, Int64(4) AS c4, Int64(5) AS c5, Int64(6) AS c6, Int64(7) AS c7,
Int64(8) AS c8, Int64(9) AS c9
TableScan: t1
Sort: t1.c1 ASC NULLS LAST
Projection: Int64(0) AS c0, t1.c1, Int64(2) AS c2,
Int64(3) AS c3, Int64(4) AS c4, Int64(5) AS c5, Int64(6) AS c6, Int64(7) AS c7,
Int64(8) AS c8, Int64(9) AS c9
TableScan: t1
Sort: t1.c2 ASC NULLS LAST
Projection: Int64(0) AS c0, Int64(1) AS c1, t1.c2,
Int64(3) AS c3, Int64(4) AS c4, Int64(5) AS c5, Int64(6) AS c6, Int64(7) AS c7,
Int64(8) AS c8, Int64(9) AS c9
TableScan: t1
Sort: t1.c3 ASC NULLS LAST
Projection: Int64(0) AS c0, Int64(1) AS c1, Int64(2) AS
c2, t1.c3, Int64(4) AS c4, Int64(5) AS c5, Int64(6) AS c6, Int64(7) AS c7,
Int64(8) AS c8, Int64(9) AS c9
TableScan: t1
Sort: t1.c4 ASC NULLS LAST
Projection: Int64(0) AS c0, Int64(1) AS c1, Int64(2) AS
c2, Int64(3) AS c3, t1.c4, Int64(5) AS c5, Int64(6) AS c6, Int64(7) AS c7,
Int64(8) AS c8, Int64(9) AS c9
TableScan: t1
Sort: t1.c5 ASC NULLS LAST
Projection: Int64(0) AS c0, Int64(1) AS c1, Int64(2) AS c2,
Int64(3) AS c3, Int64(4) AS c4, t1.c5, Int64(6) AS c6, Int64(7) AS c7, Int64(8)
AS c8, Int64(9) AS c9
TableScan: t1
Sort: t1.c6 ASC NULLS LAST
Projection: Int64(0) AS c0, Int64(1) AS c1, Int64(2) AS c2,
Int64(3) AS c3, Int64(4) AS c4, Int64(5) AS c5, t1.c6, Int64(7) AS c7, Int64(8)
AS c8, Int64(9) AS c9
TableScan: t1
Sort: t1.c7 ASC NULLS LAST
Projection: Int64(0) AS c0, Int64(1) AS c1, Int64(2) AS c2,
Int64(3) AS c3, Int64(4) AS c4, Int64(5) AS c5, Int64(6) AS c6, t1.c7, Int64(8)
AS c8, Int64(9) AS c9
TableScan: t1
Sort: t1.c8 ASC NULLS LAST
Projection: Int64(0) AS c0, Int64(1) AS c1, Int64(2) AS c2,
Int64(3) AS c3, Int64(4) AS c4, Int64(5) AS c5, Int64(6) AS c6, Int64(7) AS c7,
t1.c8, Int64(9) AS c9
TableScan: t1
Sort: t1.c9 ASC NULLS LAST
Projection: Int64(0) AS c0, Int64(1) AS c1, Int64(2) AS c2, Int64(3)
AS c3, Int64(4) AS c4, Int64(5) AS c5, Int64(6) AS c6, Int64(7) AS c7, Int64(8)
AS c8, t1.c9
TableScan: t1
```
Here, `0` which is a reference to the tables column number is interpreted as
an literal `Int64(0)` before being referred.
`TypeCoercionRewriter` using coerce_union:-
https://github.com/apache/datafusion/blob/cf0357478e28ba79bf70684e15c65ab8ddd062a1/datafusion/optimizer/src/analyzer/type_coercion.rs#L216-L243
Which tries to coerce union's Input into same time. It sees that it has an
Int64 literal as an argument and other UInt64 columns and tries to coerce it
using coercion Rules mentioned here:-
https://github.com/apache/datafusion/blob/cf0357478e28ba79bf70684e15c65ab8ddd062a1/datafusion/expr-common/src/type_coercion/binary.rs#L1293-L1299
Which results in a projection layer being added on top of union to coerce
its output
something like:-
```
Projection: c0, c1, CAST(c2 AS Decimal128(20, 0)) AS c2, c3, c4, c5, c6, c7,
c8, c9
Union
Projection: CAST(t1.c0 AS Decimal128(20, 0)) AS c0, CAST(c1 AS
Decimal128(20, 0)) AS c1, c2, c3, c4, c5, c6, c7, c8, c9
Sort: t1.c0 ASC NULLS LAST
```
Now, union does not have its descendant unions as direct children so,
`EliminateNestedUnion` fails to eliminate this Union.
After I remove above mentioned two coercion rules(2nd one also tries to
coerce UInt64, and adds a projection layer)
I get an optimized logical plan like the Int64 datatype scenario.
```
> ORDER BY c0, c1, c2, c3, c4, c5, c6, c7, c8, c9;
>
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> | plan_type | plan
|
>
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> | Plan with Metrics | SortPreservingMergeExec: [c0@0 ASC NULLS LAST, c1@1
ASC NULLS LAST, c2@2 ASC NULLS LAST, c3@3 ASC NULLS LAST, c4@4 ASC NULLS LAST,
c5@5 ASC NULLS LAST, c6@6 ASC NULLS LAST, c7@7 ASC NULLS LAST, c8@8 ASC NULLS
LAST, c9@9 ASC NULLS LAST], metrics=[output_rows=30, elapsed_compute=202.88µs,
output_bytes=2.3 KB] |
> | | UnionExec, metrics=[output_rows=30,
elapsed_compute=1.117171ms, output_bytes=2.3 KB]
|
> | | SortExec: expr=[c0@0 ASC NULLS LAST],
preserve_partitioning=[false], metrics=[output_rows=3,
elapsed_compute=1.712165ms, output_bytes=0.0 B, spill_count=0,
spilled_bytes=0.0 B, spilled_rows=0, batches_split=0]
|
> | | ProjectionExec: expr=[CAST(c0@0 AS Int64) as
c0, 1 as c1, 2 as c2, 3 as c3, 4 as c4, 5 as c5, 6 as c6, 7 as c7, 8 as c8, 9
as c9], metrics=[output_rows=3, elapsed_compute=79.9µs, output_bytes=280.0 B]
|
> | | DataSourceExec: partitions=1,
partition_sizes=[1], metrics=[]
|
> | | SortExec: expr=[c1@1 ASC NULLS LAST],
preserve_partitioning=[false], metrics=[output_rows=3,
elapsed_compute=1.244665ms, output_bytes=0.0 B, spill_count=0,
spilled_bytes=0.0 B, spilled_rows=0, batches_split=0]
|
> | | ProjectionExec: expr=[0 as c0, CAST(c1@0 AS
Int64) as c1, 2 as c2, 3 as c3, 4 as c4, 5 as c5, 6 as c6, 7 as c7, 8 as c8, 9
as c9], metrics=[output_rows=3, elapsed_compute=86.887µs, output_bytes=280.0 B]
|
> | | DataSourceExec: partitions=1,
partition_sizes=[1], metrics=[]
|
> | | SortExec: expr=[c2@2 ASC NULLS LAST],
preserve_partitioning=[false], metrics=[output_rows=3,
elapsed_compute=1.217714ms, output_bytes=0.0 B, spill_count=0,
spilled_bytes=0.0 B, spilled_rows=0, batches_split=0]
|
> | | ProjectionExec: expr=[0 as c0, 1 as c1,
CAST(c2@0 AS Int64) as c2, 3 as c3, 4 as c4, 5 as c5, 6 as c6, 7 as c7, 8 as
c8, 9 as c9], metrics=[output_rows=3, elapsed_compute=59.552µs,
output_bytes=280.0 B]
|
> | | DataSourceExec: partitions=1,
partition_sizes=[1], metrics=[]
|
> | | SortExec: expr=[c3@3 ASC NULLS LAST],
preserve_partitioning=[false], metrics=[output_rows=3,
elapsed_compute=136.622µs, output_bytes=0.0 B, spill_count=0, spilled_bytes=0.0
B, spilled_rows=0, batches_split=0]
|
> | | ProjectionExec: expr=[0 as c0, 1 as c1, 2 as
c2, CAST(c3@0 AS Int64) as c3, 4 as c4, 5 as c5, 6 as c6, 7 as c7, 8 as c8, 9
as c9], metrics=[output_rows=3, elapsed_compute=90.253µs, output_bytes=280.0 B]
|
> | | DataSourceExec: partitions=1,
partition_sizes=[1], metrics=[]
|
> | | SortExec: expr=[c4@4 ASC NULLS LAST],
preserve_partitioning=[false], metrics=[output_rows=3,
elapsed_compute=1.248026ms, output_bytes=0.0 B, spill_count=0,
spilled_bytes=0.0 B, spilled_rows=0, batches_split=0]
|
> | | ProjectionExec: expr=[0 as c0, 1 as c1, 2 as
c2, 3 as c3, CAST(c4@0 AS Int64) as c4, 5 as c5, 6 as c6, 7 as c7, 8 as c8, 9
as c9], metrics=[output_rows=3, elapsed_compute=100.374µs, output_bytes=280.0
B]
|
> | | DataSourceExec: partitions=1,
partition_sizes=[1], metrics=[]
|
> | | SortExec: expr=[c5@5 ASC NULLS LAST],
preserve_partitioning=[false], metrics=[output_rows=3,
elapsed_compute=1.01873ms, output_bytes=0.0 B, spill_count=0, spilled_bytes=0.0
B, spilled_rows=0, batches_split=0]
|
> | | ProjectionExec: expr=[0 as c0, 1 as c1, 2 as
c2, 3 as c3, 4 as c4, CAST(c5@0 AS Int64) as c5, 6 as c6, 7 as c7, 8 as c8, 9
as c9], metrics=[output_rows=3, elapsed_compute=70.435µs, output_bytes=280.0 B]
|
> | | DataSourceExec: partitions=1,
partition_sizes=[1], metrics=[]
|
> | | SortExec: expr=[c6@6 ASC NULLS LAST],
preserve_partitioning=[false], metrics=[output_rows=3,
elapsed_compute=512.958µs, output_bytes=0.0 B, spill_count=0, spilled_bytes=0.0
B, spilled_rows=0, batches_split=0]
|
> | | ProjectionExec: expr=[0 as c0, 1 as c1, 2 as
c2, 3 as c3, 4 as c4, 5 as c5, CAST(c6@0 AS Int64) as c6, 7 as c7, 8 as c8, 9
as c9], metrics=[output_rows=3, elapsed_compute=64.86µs, output_bytes=280.0 B]
|
> | | DataSourceExec: partitions=1,
partition_sizes=[1], metrics=[]
|
> | | SortExec: expr=[c7@7 ASC NULLS LAST],
preserve_partitioning=[false], metrics=[output_rows=3,
elapsed_compute=584.371µs, output_bytes=0.0 B, spill_count=0, spilled_bytes=0.0
B, spilled_rows=0, batches_split=0]
|
> | | ProjectionExec: expr=[0 as c0, 1 as c1, 2 as
c2, 3 as c3, 4 as c4, 5 as c5, 6 as c6, CAST(c7@0 AS Int64) as c7, 8 as c8, 9
as c9], metrics=[output_rows=3, elapsed_compute=64.695µs, output_bytes=280.0 B]
|
> | | DataSourceExec: partitions=1,
partition_sizes=[1], metrics=[]
|
> | | SortExec: expr=[c8@8 ASC NULLS LAST],
preserve_partitioning=[false], metrics=[output_rows=3,
elapsed_compute=161.32µs, output_bytes=0.0 B, spill_count=0, spilled_bytes=0.0
B, spilled_rows=0, batches_split=0]
|
> | | ProjectionExec: expr=[0 as c0, 1 as c1, 2 as
c2, 3 as c3, 4 as c4, 5 as c5, 6 as c6, 7 as c7, CAST(c8@0 AS Int64) as c8, 9
as c9], metrics=[output_rows=3, elapsed_compute=72.97µs, output_bytes=280.0 B]
|
> | | DataSourceExec: partitions=1,
partition_sizes=[1], metrics=[]
|
> | | SortExec: expr=[c9@9 ASC NULLS LAST],
preserve_partitioning=[false], metrics=[output_rows=3,
elapsed_compute=48.301µs, output_bytes=0.0 B, spill_count=0, spilled_bytes=0.0
B, spilled_rows=0, batches_split=0]
|
> | | ProjectionExec: expr=[0 as c0, 1 as c1, 2 as
c2, 3 as c3, 4 as c4, 5 as c5, 6 as c6, 7 as c7, 8 as c8, CAST(c9@0 AS Int64)
as c9], metrics=[output_rows=3, elapsed_compute=24.715µs, output_bytes=280.0 B]
|
> | | DataSourceExec: partitions=1,
partition_sizes=[1], metrics=[]
|
> | |
|
>
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
### Proposed Solution:-
Stop `TypeCoercercionRewriter` from trying to coerce on the basis of column
reference's inferred type.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]