Mark1626 opened a new issue, #18922:
URL: https://github.com/apache/datafusion/issues/18922
### Describe the bug
I have a hive partitioned TPC-DS dataset and I'm using a custom table
provider where I'm doing some pre-scan partition pruning using
`PartitionPruningStatistics`
A query with a single value in the filter expr works
```
select ss_list_price
from store_sales
where ss_sold_date_sk = 2451529 limit 10;
```
But when there are multiple values in the filter expr it fails
```
select ss_list_price
from store_sales
where ss_sold_date_sk in (2451529, 2452570, 2452596) limit 10;
```
This part of the code seems to be the problem
https://github.com/apache/datafusion/blob/d24eb4a23156b7814836e765d5890186ab40682f/datafusion/common/src/pruning.rs#L240-L250
I think `arrow::compute::kernels::boolean::or` should be used here instead
of the `arrow::compute::kernels::boolean::and`
This query works from the `datafusion-cli` as I suspect that the file
statistics prevents the accidental pruning
```
CREATE EXTERNAL TABLE store_sales
STORED AS PARQUET
LOCATION '/path/to/tpcds_1_delta/store_sales/';
select ss_list_price
from store_sales
where ss_sold_date_sk in (2451529, 2452570, 2452596) limit 10;
```
### To Reproduce
The following unit test will fail in `pruning.rs`
```
#[test]
fn test_partition_pruning_statistics_multiple_values() {
let partition_values = vec![
vec![ScalarValue::from(1i32), ScalarValue::from(2i32)],
vec![ScalarValue::from(3i32), ScalarValue::from(4i32)],
];
let partition_fields = vec![
Arc::new(Field::new("a", DataType::Int32, false)),
Arc::new(Field::new("b", DataType::Int32, false)),
];
let partition_stats =
PartitionPruningStatistics::try_new(partition_values,
partition_fields)
.unwrap();
let column_a = Column::new_unqualified("a");
let column_b = Column::new_unqualified("b");
// Corresponds to
// select * from table where a in (1, 3);
let values = HashSet::from([ScalarValue::from(1i32),
ScalarValue::from(3i32)]);
let contained_a = partition_stats.contained(&column_a, &values).unwrap();
let expected_contained_a = BooleanArray::from(vec![true, true]);
assert_eq!(contained_a, expected_contained_a);
}
```
### Expected behavior
The unit test mentioned above should pass
### Additional context
I can raise a PR for this, let me know if my analysis was correct and if
there is any background context behind why the condition is an `AND` in the
`contained` function
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]