bharath-techie opened a new issue, #20267:
URL: https://github.com/apache/datafusion/issues/20267

   ### Describe the bug
   
   When `enable_dynamic_filter_pushdown` and 
`enable_aggregate_dynamic_filter_pushdown` are enabled (the defaults), queries 
on parquet files produce incorrect results with target_partitions > 1.
   
   In this case The dynamic filter derived from MAX("ResolutionHeight") is 
incorrectly pushed down on ResolutionWidth instead of ResolutionHeight [ just 
from what i understand ], causing valid file ranges/row groups to be pruned.
   
   
   ### To Reproduce
   
   ### Setup
   
   1. Use the ClickBench hits partitioned dataset
   
   2. use datafusion-cli
   
   3. CREATE EXTERNAL TABLE hits
   STORED AS PARQUET
   LOCATION '/path/to/partitioned/hits/';
   
   4. Query with default settings -- returns wrong results
   ```
   >  SELECT COUNT(*) as cnt, AVG("Age") as avg_age, SUM("IsRefresh") as 
refreshes, AVG("ResolutionWidth") as avg_width, MAX("ResolutionHeight") as 
max_height 
   FROM hits 
   WHERE "CounterID" < 1000;
   +---------+-------------------+-----------+--------------------+------------+
   | cnt     | avg_age           | refreshes | avg_width          | max_height |
   +---------+-------------------+-----------+--------------------+------------+
   | 1119068 | 26.87201403310612 | 258245    | 1539.4615429982807 | 9575       |
   +---------+-------------------+-----------+--------------------+------------+
   1 row(s) fetched. 
   Elapsed 0.038 seconds.
   ```
   5. Set the target partition to 1 --- same query returns right results
   ```
   SELECT COUNT(*) as cnt, AVG("Age") as avg_age, SUM("IsRefresh") as 
refreshes, AVG("ResolutionWidth") as avg_width, MAX("ResolutionHeight") as 
max_height 
   FROM hits 
   WHERE "CounterID" < 1000;
   0 row(s) fetched. 
   Elapsed 0.000 seconds.
   
   0 row(s) fetched. 
   Elapsed 0.000 seconds.
   
   +---------+-------------------+-----------+--------------------+------------+
   | cnt     | avg_age           | refreshes | avg_width          | max_height |
   +---------+-------------------+-----------+--------------------+------------+
   | 1591782 | 26.14465108915668 | 266931    | 1330.2180763446252 | 9575       |
   +---------+-------------------+-----------+--------------------+------------+
   1 row(s) fetched. 
   Elapsed 0.038 seconds.
   ```
   
   6. Actual workaround with partitions
   
   Disabling dynamic filter pushdown produces correct results even with 
parallel execution:
   ```
   SET datafusion.optimizer.enable_dynamic_filter_pushdown = false;
   SET datafusion.optimizer.enable_aggregate_dynamic_filter_pushdown = false;
   ```
   
   
   ### Expected behavior
   
   The query should return right result with dynamic filter pushdown enabled / 
default settings with any number of partitions.
   
   ### Additional context
   
   check `files_ranges_pruned_statistics` 
   Wrong results - explain analyze output with dynamic filters enabled : 
   ```
   |                   |           DataSourceExec: file_groups={10 groups: 
[[home/ec2-user/clickdata/partitioned/hits/hits_0.parquet:0..122446530, 
home/ec2-user/clickdata/partitioned/hits/hits_1.parquet:0..174965044, 
home/ec2-user/clickdata/partitioned/hits/hits_10.parquet:0..101513258, 
home/ec2-user/clickdata/partitioned/hits/hits_11.parquet:0..118419888, 
home/ec2-user/clickdata/partitioned/hits/hits_12.parquet:0..149514164, ...], 
[home/ec2-user/clickdata/partitioned/hits/hits_19.parquet:55500514..103692598, 
home/ec2-user/clickdata/partitioned/hits/hits_2.parquet:0..230595491, 
home/ec2-user/clickdata/partitioned/hits/hits_20.parquet:0..85766533, 
home/ec2-user/clickdata/partitioned/hits/hits_21.parquet:0..113455196, 
home/ec2-user/clickdata/partitioned/hits/hits_22.parquet:0..79775901, ...], 
[home/ec2-user/clickdata/partitioned/hits/hits_3.parquet:49140764..192507052, 
home/ec2-user/clickdata/partitioned/hits/hits_30.parquet:0..124187913, 
home/ec2-user/clickdata/partitioned/hits/hits_3
 1.parquet:0..123065410, 
home/ec2-user/clickdata/partitioned/hits/hits_32.parquet:0..94506004, 
home/ec2-user/clickdata/partitioned/hits/hits_33.parquet:0..78243765, ...], 
[home/ec2-user/clickdata/partitioned/hits/hits_40.parquet:60145803..142508647, 
home/ec2-user/clickdata/partitioned/hits/hits_41.parquet:0..290614269, 
home/ec2-user/clickdata/partitioned/hits/hits_42.parquet:0..288524057, 
home/ec2-user/clickdata/partitioned/hits/hits_43.parquet:0..299692947, 
home/ec2-user/clickdata/partitioned/hits/hits_44.parquet:0..242404750, ...], 
[home/ec2-user/clickdata/partitioned/hits/hits_47.parquet:29698740..34336875, 
home/ec2-user/clickdata/partitioned/hits/hits_48.parquet:0..33680419, 
home/ec2-user/clickdata/partitioned/hits/hits_49.parquet:0..99795554, 
home/ec2-user/clickdata/partitioned/hits/hits_5.parquet:0..122286439, 
home/ec2-user/clickdata/partitioned/hits/hits_50.parquet:0..245339079, ...], 
...]}, 
   
   projection=[CounterID, IsRefresh, ResolutionWidth, ResolutionHeight, Age], 
file_type=parquet, 
   predicate=CounterID@6 < 1000 AND DynamicFilter [ ResolutionWidth@20 > 9575 
], 
   pruning_predicate=CounterID_null_count@1 != row_count@2 AND CounterID_min@0 
< 1000 AND ResolutionWidth_null_count@4 != row_count@2 AND 
ResolutionWidth_max@3 > 9575, 
   
   
   required_guarantees=[], metrics=[output_rows=1.40 M, elapsed_compute=10ns, 
output_bytes=16.0 MB, output_batches=171, files_ranges_pruned_statistics=109 
total → 1 matched, 
   
   row_groups_pruned_statistics=8 total → 6 matched, 
row_groups_pruned_bloom_filter=6 total → 6 matched, page_index_rows_pruned=0 
total → 0 matched, batches_split=0, bytes_scanned=1.21 M, file_open_errors=0, 
file_scan_errors=0, num_predicate_creation_errors=0, 
predicate_cache_inner_records=0, predicate_cache_records=0, 
predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, 
bloom_filter_eval_time=59.03µs, metadata_load_time=926.80µs, 
page_index_eval_time=741ns, row_pushdown_eval_time=218ns, 
statistics_eval_time=46.58µs, time_elapsed_opening=2.49ms, 
time_elapsed_processing=49.95ms, time_elapsed_scanning_total=49.87ms, 
time_elapsed_scanning_until_data=10.93ms, scan_efficiency_ratio=1.2% (1.21 
M/103.7 M)]
   ```
   
   Correct results - explain analyze output - dynamic filters disabled :
   ```
   |                   |           DataSourceExec: file_groups={10 groups: 
[[home/ec2-user/clickdata/partitioned/hits/hits_0.parquet:0..122446530, 
home/ec2-user/clickdata/partitioned/hits/hits_1.parquet:0..174965044, 
home/ec2-user/clickdata/partitioned/hits/hits_10.parquet:0..101513258, 
home/ec2-user/clickdata/partitioned/hits/hits_11.parquet:0..118419888, 
home/ec2-user/clickdata/partitioned/hits/hits_12.parquet:0..149514164, ...], 
[home/ec2-user/clickdata/partitioned/hits/hits_19.parquet:55500514..103692598, 
home/ec2-user/clickdata/partitioned/hits/hits_2.parquet:0..230595491, 
home/ec2-user/clickdata/partitioned/hits/hits_20.parquet:0..85766533, 
home/ec2-user/clickdata/partitioned/hits/hits_21.parquet:0..113455196, 
home/ec2-user/clickdata/partitioned/hits/hits_22.parquet:0..79775901, ...], 
[home/ec2-user/clickdata/partitioned/hits/hits_3.parquet:49140764..192507052, 
home/ec2-user/clickdata/partitioned/hits/hits_30.parquet:0..124187913, 
home/ec2-user/clickdata/partitioned/hits/hits_3
 1.parquet:0..123065410, 
home/ec2-user/clickdata/partitioned/hits/hits_32.parquet:0..94506004, 
home/ec2-user/clickdata/partitioned/hits/hits_33.parquet:0..78243765, ...], 
[home/ec2-user/clickdata/partitioned/hits/hits_40.parquet:60145803..142508647, 
home/ec2-user/clickdata/partitioned/hits/hits_41.parquet:0..290614269, 
home/ec2-user/clickdata/partitioned/hits/hits_42.parquet:0..288524057, 
home/ec2-user/clickdata/partitioned/hits/hits_43.parquet:0..299692947, 
home/ec2-user/clickdata/partitioned/hits/hits_44.parquet:0..242404750, ...], 
[home/ec2-user/clickdata/partitioned/hits/hits_47.parquet:29698740..34336875, 
home/ec2-user/clickdata/partitioned/hits/hits_48.parquet:0..33680419, 
home/ec2-user/clickdata/partitioned/hits/hits_49.parquet:0..99795554, 
home/ec2-user/clickdata/partitioned/hits/hits_5.parquet:0..122286439, 
home/ec2-user/clickdata/partitioned/hits/hits_50.parquet:0..245339079, ...], 
...]}, projection=[CounterID, IsRefresh, ResolutionWidth, ResolutionHeight, 
Age], file_type=p
 arquet, 
   
   predicate=CounterID@6 < 1000, pruning_predicate=CounterID_null_count@1 != 
row_count@2 AND CounterID_min@0 < 1000, required_guarantees=[],
    metrics=[output_rows=2.45 M, elapsed_compute=10ns, output_bytes=28.0 MB, 
output_batches=301, files_ranges_pruned_statistics=109 total → 20 matched, 
   
   row_groups_pruned_statistics=51 total → 6 matched, 
row_groups_pruned_bloom_filter=6 total → 6 matched, page_index_rows_pruned=0 
total → 0 matched, batches_split=0, bytes_scanned=1.97 M, file_open_errors=0, 
file_scan_errors=0, num_predicate_creation_errors=0, 
predicate_cache_inner_records=0, predicate_cache_records=0, 
predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, 
bloom_filter_eval_time=61.60µs, metadata_load_time=2.64ms, 
page_index_eval_time=2.00µs, row_pushdown_eval_time=218ns, 
statistics_eval_time=62.77µs, time_elapsed_opening=2.08ms, 
time_elapsed_processing=41.38ms, time_elapsed_scanning_total=44.99ms, 
time_elapsed_scanning_until_data=10.36ms, scan_efficiency_ratio=1.9% (1.97 
M/103.7 M)] |
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to