bharath-techie opened a new issue, #20267:
URL: https://github.com/apache/datafusion/issues/20267
### Describe the bug
When `enable_dynamic_filter_pushdown` and
`enable_aggregate_dynamic_filter_pushdown` are enabled (the defaults), queries
on parquet files produce incorrect results with target_partitions > 1.
In this case The dynamic filter derived from MAX("ResolutionHeight") is
incorrectly pushed down on ResolutionWidth instead of ResolutionHeight [ just
from what i understand ], causing valid file ranges/row groups to be pruned.
### To Reproduce
### Setup
1. Use the ClickBench hits partitioned dataset
2. use datafusion-cli
3. CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION '/path/to/partitioned/hits/';
4. Query with default settings -- returns wrong results
```
> SELECT COUNT(*) as cnt, AVG("Age") as avg_age, SUM("IsRefresh") as
refreshes, AVG("ResolutionWidth") as avg_width, MAX("ResolutionHeight") as
max_height
FROM hits
WHERE "CounterID" < 1000;
+---------+-------------------+-----------+--------------------+------------+
| cnt | avg_age | refreshes | avg_width | max_height |
+---------+-------------------+-----------+--------------------+------------+
| 1119068 | 26.87201403310612 | 258245 | 1539.4615429982807 | 9575 |
+---------+-------------------+-----------+--------------------+------------+
1 row(s) fetched.
Elapsed 0.038 seconds.
```
5. Set the target partition to 1 --- same query returns right results
```
SELECT COUNT(*) as cnt, AVG("Age") as avg_age, SUM("IsRefresh") as
refreshes, AVG("ResolutionWidth") as avg_width, MAX("ResolutionHeight") as
max_height
FROM hits
WHERE "CounterID" < 1000;
0 row(s) fetched.
Elapsed 0.000 seconds.
0 row(s) fetched.
Elapsed 0.000 seconds.
+---------+-------------------+-----------+--------------------+------------+
| cnt | avg_age | refreshes | avg_width | max_height |
+---------+-------------------+-----------+--------------------+------------+
| 1591782 | 26.14465108915668 | 266931 | 1330.2180763446252 | 9575 |
+---------+-------------------+-----------+--------------------+------------+
1 row(s) fetched.
Elapsed 0.038 seconds.
```
6. Actual workaround with partitions
Disabling dynamic filter pushdown produces correct results even with
parallel execution:
```
SET datafusion.optimizer.enable_dynamic_filter_pushdown = false;
SET datafusion.optimizer.enable_aggregate_dynamic_filter_pushdown = false;
```
### Expected behavior
The query should return right result with dynamic filter pushdown enabled /
default settings with any number of partitions.
### Additional context
check `files_ranges_pruned_statistics`
Wrong results - explain analyze output with dynamic filters enabled :
```
| | DataSourceExec: file_groups={10 groups:
[[home/ec2-user/clickdata/partitioned/hits/hits_0.parquet:0..122446530,
home/ec2-user/clickdata/partitioned/hits/hits_1.parquet:0..174965044,
home/ec2-user/clickdata/partitioned/hits/hits_10.parquet:0..101513258,
home/ec2-user/clickdata/partitioned/hits/hits_11.parquet:0..118419888,
home/ec2-user/clickdata/partitioned/hits/hits_12.parquet:0..149514164, ...],
[home/ec2-user/clickdata/partitioned/hits/hits_19.parquet:55500514..103692598,
home/ec2-user/clickdata/partitioned/hits/hits_2.parquet:0..230595491,
home/ec2-user/clickdata/partitioned/hits/hits_20.parquet:0..85766533,
home/ec2-user/clickdata/partitioned/hits/hits_21.parquet:0..113455196,
home/ec2-user/clickdata/partitioned/hits/hits_22.parquet:0..79775901, ...],
[home/ec2-user/clickdata/partitioned/hits/hits_3.parquet:49140764..192507052,
home/ec2-user/clickdata/partitioned/hits/hits_30.parquet:0..124187913,
home/ec2-user/clickdata/partitioned/hits/hits_3
1.parquet:0..123065410,
home/ec2-user/clickdata/partitioned/hits/hits_32.parquet:0..94506004,
home/ec2-user/clickdata/partitioned/hits/hits_33.parquet:0..78243765, ...],
[home/ec2-user/clickdata/partitioned/hits/hits_40.parquet:60145803..142508647,
home/ec2-user/clickdata/partitioned/hits/hits_41.parquet:0..290614269,
home/ec2-user/clickdata/partitioned/hits/hits_42.parquet:0..288524057,
home/ec2-user/clickdata/partitioned/hits/hits_43.parquet:0..299692947,
home/ec2-user/clickdata/partitioned/hits/hits_44.parquet:0..242404750, ...],
[home/ec2-user/clickdata/partitioned/hits/hits_47.parquet:29698740..34336875,
home/ec2-user/clickdata/partitioned/hits/hits_48.parquet:0..33680419,
home/ec2-user/clickdata/partitioned/hits/hits_49.parquet:0..99795554,
home/ec2-user/clickdata/partitioned/hits/hits_5.parquet:0..122286439,
home/ec2-user/clickdata/partitioned/hits/hits_50.parquet:0..245339079, ...],
...]},
projection=[CounterID, IsRefresh, ResolutionWidth, ResolutionHeight, Age],
file_type=parquet,
predicate=CounterID@6 < 1000 AND DynamicFilter [ ResolutionWidth@20 > 9575
],
pruning_predicate=CounterID_null_count@1 != row_count@2 AND CounterID_min@0
< 1000 AND ResolutionWidth_null_count@4 != row_count@2 AND
ResolutionWidth_max@3 > 9575,
required_guarantees=[], metrics=[output_rows=1.40 M, elapsed_compute=10ns,
output_bytes=16.0 MB, output_batches=171, files_ranges_pruned_statistics=109
total → 1 matched,
row_groups_pruned_statistics=8 total → 6 matched,
row_groups_pruned_bloom_filter=6 total → 6 matched, page_index_rows_pruned=0
total → 0 matched, batches_split=0, bytes_scanned=1.21 M, file_open_errors=0,
file_scan_errors=0, num_predicate_creation_errors=0,
predicate_cache_inner_records=0, predicate_cache_records=0,
predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0,
bloom_filter_eval_time=59.03µs, metadata_load_time=926.80µs,
page_index_eval_time=741ns, row_pushdown_eval_time=218ns,
statistics_eval_time=46.58µs, time_elapsed_opening=2.49ms,
time_elapsed_processing=49.95ms, time_elapsed_scanning_total=49.87ms,
time_elapsed_scanning_until_data=10.93ms, scan_efficiency_ratio=1.2% (1.21
M/103.7 M)]
```
Correct results - explain analyze output - dynamic filters disabled :
```
| | DataSourceExec: file_groups={10 groups:
[[home/ec2-user/clickdata/partitioned/hits/hits_0.parquet:0..122446530,
home/ec2-user/clickdata/partitioned/hits/hits_1.parquet:0..174965044,
home/ec2-user/clickdata/partitioned/hits/hits_10.parquet:0..101513258,
home/ec2-user/clickdata/partitioned/hits/hits_11.parquet:0..118419888,
home/ec2-user/clickdata/partitioned/hits/hits_12.parquet:0..149514164, ...],
[home/ec2-user/clickdata/partitioned/hits/hits_19.parquet:55500514..103692598,
home/ec2-user/clickdata/partitioned/hits/hits_2.parquet:0..230595491,
home/ec2-user/clickdata/partitioned/hits/hits_20.parquet:0..85766533,
home/ec2-user/clickdata/partitioned/hits/hits_21.parquet:0..113455196,
home/ec2-user/clickdata/partitioned/hits/hits_22.parquet:0..79775901, ...],
[home/ec2-user/clickdata/partitioned/hits/hits_3.parquet:49140764..192507052,
home/ec2-user/clickdata/partitioned/hits/hits_30.parquet:0..124187913,
home/ec2-user/clickdata/partitioned/hits/hits_3
1.parquet:0..123065410,
home/ec2-user/clickdata/partitioned/hits/hits_32.parquet:0..94506004,
home/ec2-user/clickdata/partitioned/hits/hits_33.parquet:0..78243765, ...],
[home/ec2-user/clickdata/partitioned/hits/hits_40.parquet:60145803..142508647,
home/ec2-user/clickdata/partitioned/hits/hits_41.parquet:0..290614269,
home/ec2-user/clickdata/partitioned/hits/hits_42.parquet:0..288524057,
home/ec2-user/clickdata/partitioned/hits/hits_43.parquet:0..299692947,
home/ec2-user/clickdata/partitioned/hits/hits_44.parquet:0..242404750, ...],
[home/ec2-user/clickdata/partitioned/hits/hits_47.parquet:29698740..34336875,
home/ec2-user/clickdata/partitioned/hits/hits_48.parquet:0..33680419,
home/ec2-user/clickdata/partitioned/hits/hits_49.parquet:0..99795554,
home/ec2-user/clickdata/partitioned/hits/hits_5.parquet:0..122286439,
home/ec2-user/clickdata/partitioned/hits/hits_50.parquet:0..245339079, ...],
...]}, projection=[CounterID, IsRefresh, ResolutionWidth, ResolutionHeight,
Age], file_type=p
arquet,
predicate=CounterID@6 < 1000, pruning_predicate=CounterID_null_count@1 !=
row_count@2 AND CounterID_min@0 < 1000, required_guarantees=[],
metrics=[output_rows=2.45 M, elapsed_compute=10ns, output_bytes=28.0 MB,
output_batches=301, files_ranges_pruned_statistics=109 total → 20 matched,
row_groups_pruned_statistics=51 total → 6 matched,
row_groups_pruned_bloom_filter=6 total → 6 matched, page_index_rows_pruned=0
total → 0 matched, batches_split=0, bytes_scanned=1.97 M, file_open_errors=0,
file_scan_errors=0, num_predicate_creation_errors=0,
predicate_cache_inner_records=0, predicate_cache_records=0,
predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0,
bloom_filter_eval_time=61.60µs, metadata_load_time=2.64ms,
page_index_eval_time=2.00µs, row_pushdown_eval_time=218ns,
statistics_eval_time=62.77µs, time_elapsed_opening=2.08ms,
time_elapsed_processing=41.38ms, time_elapsed_scanning_total=44.99ms,
time_elapsed_scanning_until_data=10.36ms, scan_efficiency_ratio=1.9% (1.97
M/103.7 M)] |
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]