Re: [PR] Add configurable UNION DISTINCT to FILTER rewrite optimization [datafusion]

via GitHub Sat, 28 Mar 2026 22:29:56 -0700


xiedeyantu commented on PR #21075:
URL: https://github.com/apache/datafusion/pull/21075#issuecomment-4149478979


   I tested the execution times, and in reality, the difference between the two 
was not significant. My understanding is that the multiple branches within a 
`UNION` operation can be processed in parallel; therefore, one would not expect 
to see a substantial reduction in overall execution time (and if a significant 
improvement *were* observed, it would likely indicate an issue elsewhere). 
Consequently, I had to rely on `EXPLAIN ANALYZE` to inspect the execution plan, 
thereby demonstrating that—prior to optimization—the data required two separate 
scans, whereas after optimization, only a single scan was necessary.
   
   The test SQL script is as follows:
   ```
   # test_data.csv
   id,category,amount,created_at
   1,A,10,2026-03-29 00:00:01
   2,B,20,2026-03-29 00:00:02
   3,C,30,2026-03-29 00:00:03
   ......
   9999,D,99990,2026-03-29 02:46:39
   10000,E,100000,2026-03-29 02:46:40
   
   # create table
   CREATE EXTERNAL TABLE t (
     id INT,
     category STRING,
     amount INT,
     created_at TIMESTAMP
   )
   STORED AS CSV
   LOCATION 'test_data.csv'
   OPTIONS (
     has_header 'true'
   );
   
   set datafusion.optimizer.enable_unions_to_filter=false;
   EXPLAIN ANALYZE 
   SELECT category FROM t WHERE id > 5
   UNION
   SELECT category FROM t WHERE id < 10;
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type         | plan                                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                       |
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | AggregateExec: mode=FinalPartitioned, gby=[category@0 
as category], aggr=[], metrics=[output_rows=5, elapsed_compute=512.96µs, 
output_bytes=128.0 B, output_batches=2, spill_count=0, spilled_bytes=0.0 B, 
spilled_rows=0, peak_mem_used=3.49 K, aggregate_arguments_time=8ns, 
aggregation_time=8ns, emitting_time=5.51µs, time_calculating_group_ids=21.34µs] 
                                                                                
                                               |
   |                   |   RepartitionExec: partitioning=Hash([category@0], 8), 
input_partitions=16, metrics=[output_rows=15, elapsed_compute=129.16µs, 
output_bytes=256.0 KB, output_batches=2, spill_count=0, spilled_bytes=0.0 B, 
spilled_rows=0, fetch_time=311.11ms, repartition_time=52.80µs, 
send_time=27.66µs]                                                              
                                                                                
                                                   |
   |                   |     AggregateExec: mode=Partial, gby=[category@0 as 
category], aggr=[], metrics=[output_rows=15, elapsed_compute=2.50ms, 
output_bytes=384.0 B, output_batches=3, spill_count=0, spilled_bytes=0.0 B, 
spilled_rows=0, skipped_aggregation_rows=0, peak_mem_used=167.2 K, 
aggregate_arguments_time=16ns, aggregation_time=16ns, emitting_time=16.47µs, 
time_calculating_group_ids=2.06ms, reduction_factor=0.15% (15/10.00 K)]         
                                                         |
   |                   |       UnionExec, metrics=[output_rows=10.00 K, 
elapsed_compute=298.29µs, output_bytes=383.9 KB, output_batches=3]              
                                                                                
                                                                                
                                                                                
                                                                                
                               |
   |                   |         FilterExec: id@0 > 5, projection=[category@1], 
metrics=[output_rows=9.99 K, elapsed_compute=418.21µs, output_bytes=255.9 KB, 
output_batches=2, selectivity=100% (9.99 K/10.00 K)]                            
                                                                                
                                                                                
                                                                                
                         |
   |                   |           RepartitionExec: 
partitioning=RoundRobinBatch(8), input_partitions=1, metrics=[output_rows=10.00 
K, elapsed_compute=35.17µs, output_bytes=320.0 KB, output_batches=2, 
spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, fetch_time=19.31ms, 
repartition_time=1ns, send_time=16.59µs]                                        
                                                                                
                                                                      |
   |                   |             DataSourceExec: file_groups={1 group: 
[[Users/jensen/test/test_data.csv]]}, projection=[id, category], file_type=csv, 
has_header=true, metrics=[output_rows=10.00 K, elapsed_compute=18.83ms, 
output_bytes=196.2 KB, output_batches=2, batches_split=0, file_open_errors=0, 
file_scan_errors=0, files_opened=1, files_processed=1, 
time_elapsed_opening=359.75µs, time_elapsed_processing=19.09ms, 
time_elapsed_scanning_total=18.89ms, time_elapsed_scanning_until_data=15.39ms] |
   |                   |         FilterExec: id@0 < 10, 
projection=[category@1], metrics=[output_rows=9, elapsed_compute=310.84µs, 
output_bytes=128.0 KB, output_batches=1, selectivity=0.09% (9/10.00 K)]         
                                                                                
                                                                                
                                                                                
                                                    |
   |                   |           RepartitionExec: 
partitioning=RoundRobinBatch(8), input_partitions=1, metrics=[output_rows=10.00 
K, elapsed_compute=62.12µs, output_bytes=320.0 KB, output_batches=2, 
spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, fetch_time=18.82ms, 
repartition_time=1ns, send_time=47.76µs]                                        
                                                                                
                                                                      |
   |                   |             DataSourceExec: file_groups={1 group: 
[[Users/jensen/test/test_data.csv]]}, projection=[id, category], file_type=csv, 
has_header=true, metrics=[output_rows=10.00 K, elapsed_compute=18.35ms, 
output_bytes=196.2 KB, output_batches=2, batches_split=0, file_open_errors=0, 
file_scan_errors=0, files_opened=1, files_processed=1, 
time_elapsed_opening=364.33µs, time_elapsed_processing=18.63ms, 
time_elapsed_scanning_total=18.48ms, time_elapsed_scanning_until_data=15.05ms] |
   |                   |                                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                       |
   
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   1 row(s) fetched. 
   Elapsed 0.038 seconds.
   
   set datafusion.optimizer.enable_unions_to_filter=true;
   EXPLAIN ANALYZE 
   SELECT category FROM t WHERE id > 5
   UNION
   SELECT category FROM t WHERE id < 10;
   
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type         | plan                                                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                     |
   
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | Plan with Metrics | AggregateExec: mode=FinalPartitioned, gby=[category@0 
as category], aggr=[], metrics=[output_rows=5, elapsed_compute=374.54µs, 
output_bytes=128.0 B, output_batches=2, spill_count=0, spilled_bytes=0.0 B, 
spilled_rows=0, peak_mem_used=3.36 K, aggregate_arguments_time=8ns, 
aggregation_time=8ns, emitting_time=5.50µs, time_calculating_group_ids=19.13µs] 
                                                                                
                                             |
   |                   |   RepartitionExec: partitioning=Hash([category@0], 8), 
input_partitions=8, metrics=[output_rows=10, elapsed_compute=97.63µs, 
output_bytes=256.0 KB, output_batches=2, spill_count=0, spilled_bytes=0.0 B, 
spilled_rows=0, fetch_time=180.39ms, repartition_time=48.84µs, 
send_time=30.44µs]                                                              
                                                                                
                                                   |
   |                   |     AggregateExec: mode=Partial, gby=[category@0 as 
category], aggr=[], metrics=[output_rows=10, elapsed_compute=2.00ms, 
output_bytes=256.0 B, output_batches=2, spill_count=0, spilled_bytes=0.0 B, 
spilled_rows=0, skipped_aggregation_rows=0, peak_mem_used=164.5 K, 
aggregate_arguments_time=8ns, aggregation_time=8ns, emitting_time=14.63µs, 
time_calculating_group_ids=1.81ms, reduction_factor=0.1% (10/10.00 K)]          
                                                         |
   |                   |       FilterExec: id@0 > 5 OR id@0 < 10, 
projection=[category@1], metrics=[output_rows=10.00 K, 
elapsed_compute=526.17µs, output_bytes=256.0 KB, output_batches=2, 
selectivity=100% (10.00 K/10.00 K)]                                             
                                                                                
                                                                                
                                                                         |
   |                   |         RepartitionExec: 
partitioning=RoundRobinBatch(8), input_partitions=1, metrics=[output_rows=10.00 
K, elapsed_compute=58.50µs, output_bytes=320.0 KB, output_batches=2, 
spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, fetch_time=22.05ms, 
repartition_time=1ns, send_time=25.67µs]                                        
                                                                                
                                                                      |
   |                   |           DataSourceExec: file_groups={1 group: 
[[Users/jensen/test/test_data.csv]]}, projection=[id, category], file_type=csv, 
has_header=true, metrics=[output_rows=10.00 K, elapsed_compute=21.49ms, 
output_bytes=196.2 KB, output_batches=2, batches_split=0, file_open_errors=0, 
file_scan_errors=0, files_opened=1, files_processed=1, 
time_elapsed_opening=468.54µs, time_elapsed_processing=21.82ms, 
time_elapsed_scanning_total=21.57ms, time_elapsed_scanning_until_data=18.30ms] |
   |                   |                                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                     |
   
+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   1 row(s) fetched. 
   Elapsed 0.040 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add configurable UNION DISTINCT to FILTER rewrite optimization [datafusion]

Reply via email to