xiedeyantu commented on PR #21075:
URL: https://github.com/apache/datafusion/pull/21075#issuecomment-4149478979
I tested the execution times, and in reality, the difference between the two
was not significant. My understanding is that the multiple branches within a
`UNION` operation can be processed in parallel; therefore, one would not expect
to see a substantial reduction in overall execution time (and if a significant
improvement *were* observed, it would likely indicate an issue elsewhere).
Consequently, I had to rely on `EXPLAIN ANALYZE` to inspect the execution plan,
thereby demonstrating that—prior to optimization—the data required two separate
scans, whereas after optimization, only a single scan was necessary.
The test SQL script is as follows:
```
# test_data.csv
id,category,amount,created_at
1,A,10,2026-03-29 00:00:01
2,B,20,2026-03-29 00:00:02
3,C,30,2026-03-29 00:00:03
......
9999,D,99990,2026-03-29 02:46:39
10000,E,100000,2026-03-29 02:46:40
# create table
CREATE EXTERNAL TABLE t (
id INT,
category STRING,
amount INT,
created_at TIMESTAMP
)
STORED AS CSV
LOCATION 'test_data.csv'
OPTIONS (
has_header 'true'
);
set datafusion.optimizer.enable_unions_to_filter=false;
EXPLAIN ANALYZE
SELECT category FROM t WHERE id > 5
UNION
SELECT category FROM t WHERE id < 10;
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | AggregateExec: mode=FinalPartitioned, gby=[category@0
as category], aggr=[], metrics=[output_rows=5, elapsed_compute=512.96µs,
output_bytes=128.0 B, output_batches=2, spill_count=0, spilled_bytes=0.0 B,
spilled_rows=0, peak_mem_used=3.49 K, aggregate_arguments_time=8ns,
aggregation_time=8ns, emitting_time=5.51µs, time_calculating_group_ids=21.34µs]
|
| | RepartitionExec: partitioning=Hash([category@0], 8),
input_partitions=16, metrics=[output_rows=15, elapsed_compute=129.16µs,
output_bytes=256.0 KB, output_batches=2, spill_count=0, spilled_bytes=0.0 B,
spilled_rows=0, fetch_time=311.11ms, repartition_time=52.80µs,
send_time=27.66µs]
|
| | AggregateExec: mode=Partial, gby=[category@0 as
category], aggr=[], metrics=[output_rows=15, elapsed_compute=2.50ms,
output_bytes=384.0 B, output_batches=3, spill_count=0, spilled_bytes=0.0 B,
spilled_rows=0, skipped_aggregation_rows=0, peak_mem_used=167.2 K,
aggregate_arguments_time=16ns, aggregation_time=16ns, emitting_time=16.47µs,
time_calculating_group_ids=2.06ms, reduction_factor=0.15% (15/10.00 K)]
|
| | UnionExec, metrics=[output_rows=10.00 K,
elapsed_compute=298.29µs, output_bytes=383.9 KB, output_batches=3]
|
| | FilterExec: id@0 > 5, projection=[category@1],
metrics=[output_rows=9.99 K, elapsed_compute=418.21µs, output_bytes=255.9 KB,
output_batches=2, selectivity=100% (9.99 K/10.00 K)]
|
| | RepartitionExec:
partitioning=RoundRobinBatch(8), input_partitions=1, metrics=[output_rows=10.00
K, elapsed_compute=35.17µs, output_bytes=320.0 KB, output_batches=2,
spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, fetch_time=19.31ms,
repartition_time=1ns, send_time=16.59µs]
|
| | DataSourceExec: file_groups={1 group:
[[Users/jensen/test/test_data.csv]]}, projection=[id, category], file_type=csv,
has_header=true, metrics=[output_rows=10.00 K, elapsed_compute=18.83ms,
output_bytes=196.2 KB, output_batches=2, batches_split=0, file_open_errors=0,
file_scan_errors=0, files_opened=1, files_processed=1,
time_elapsed_opening=359.75µs, time_elapsed_processing=19.09ms,
time_elapsed_scanning_total=18.89ms, time_elapsed_scanning_until_data=15.39ms] |
| | FilterExec: id@0 < 10,
projection=[category@1], metrics=[output_rows=9, elapsed_compute=310.84µs,
output_bytes=128.0 KB, output_batches=1, selectivity=0.09% (9/10.00 K)]
|
| | RepartitionExec:
partitioning=RoundRobinBatch(8), input_partitions=1, metrics=[output_rows=10.00
K, elapsed_compute=62.12µs, output_bytes=320.0 KB, output_batches=2,
spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, fetch_time=18.82ms,
repartition_time=1ns, send_time=47.76µs]
|
| | DataSourceExec: file_groups={1 group:
[[Users/jensen/test/test_data.csv]]}, projection=[id, category], file_type=csv,
has_header=true, metrics=[output_rows=10.00 K, elapsed_compute=18.35ms,
output_bytes=196.2 KB, output_batches=2, batches_split=0, file_open_errors=0,
file_scan_errors=0, files_opened=1, files_processed=1,
time_elapsed_opening=364.33µs, time_elapsed_processing=18.63ms,
time_elapsed_scanning_total=18.48ms, time_elapsed_scanning_until_data=15.05ms] |
| |
|
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.038 seconds.
set datafusion.optimizer.enable_unions_to_filter=true;
EXPLAIN ANALYZE
SELECT category FROM t WHERE id > 5
UNION
SELECT category FROM t WHERE id < 10;
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | AggregateExec: mode=FinalPartitioned, gby=[category@0
as category], aggr=[], metrics=[output_rows=5, elapsed_compute=374.54µs,
output_bytes=128.0 B, output_batches=2, spill_count=0, spilled_bytes=0.0 B,
spilled_rows=0, peak_mem_used=3.36 K, aggregate_arguments_time=8ns,
aggregation_time=8ns, emitting_time=5.50µs, time_calculating_group_ids=19.13µs]
|
| | RepartitionExec: partitioning=Hash([category@0], 8),
input_partitions=8, metrics=[output_rows=10, elapsed_compute=97.63µs,
output_bytes=256.0 KB, output_batches=2, spill_count=0, spilled_bytes=0.0 B,
spilled_rows=0, fetch_time=180.39ms, repartition_time=48.84µs,
send_time=30.44µs]
|
| | AggregateExec: mode=Partial, gby=[category@0 as
category], aggr=[], metrics=[output_rows=10, elapsed_compute=2.00ms,
output_bytes=256.0 B, output_batches=2, spill_count=0, spilled_bytes=0.0 B,
spilled_rows=0, skipped_aggregation_rows=0, peak_mem_used=164.5 K,
aggregate_arguments_time=8ns, aggregation_time=8ns, emitting_time=14.63µs,
time_calculating_group_ids=1.81ms, reduction_factor=0.1% (10/10.00 K)]
|
| | FilterExec: id@0 > 5 OR id@0 < 10,
projection=[category@1], metrics=[output_rows=10.00 K,
elapsed_compute=526.17µs, output_bytes=256.0 KB, output_batches=2,
selectivity=100% (10.00 K/10.00 K)]
|
| | RepartitionExec:
partitioning=RoundRobinBatch(8), input_partitions=1, metrics=[output_rows=10.00
K, elapsed_compute=58.50µs, output_bytes=320.0 KB, output_batches=2,
spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, fetch_time=22.05ms,
repartition_time=1ns, send_time=25.67µs]
|
| | DataSourceExec: file_groups={1 group:
[[Users/jensen/test/test_data.csv]]}, projection=[id, category], file_type=csv,
has_header=true, metrics=[output_rows=10.00 K, elapsed_compute=21.49ms,
output_bytes=196.2 KB, output_batches=2, batches_split=0, file_open_errors=0,
file_scan_errors=0, files_opened=1, files_processed=1,
time_elapsed_opening=468.54µs, time_elapsed_processing=21.82ms,
time_elapsed_scanning_total=21.57ms, time_elapsed_scanning_until_data=18.30ms] |
| |
|
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.040 seconds.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]