lyne7-sc opened a new issue, #22994:
URL: https://github.com/apache/datafusion/issues/22994
### Describe the bug
Parquet bloom filter pruning can return incorrect results for decimal
columns encoded as `FIXED_LEN_BYTE_ARRAY`.
When a decimal column is encoded as `FIXED_LEN_BYTE_ARRAY`, the bloom filter
is built from the physical Parquet bytes. DataFusion currently checks the bloom
filter using a fixed-width integer byte representation, which may not match the
fixed byte length used in the Parquet file.
This can cause false negatives in bloom filter pruning and incorrectly skip
row groups that contain matching rows.
### To Reproduce
```sql
COPY (
SELECT CAST(column1 AS DECIMAL(19,2)) AS decimal_col
FROM (VALUES (1), (2), (3), (4), (5), (6))
) TO '/tmp/df_decimal_bloom_repro'
STORED AS PARQUET
OPTIONS (
'format.max_row_group_size' '2',
'format.bloom_filter_on_write' 'true',
'format.statistics_enabled' 'none'
);
SELECT COUNT(*) AS cnt
FROM '/tmp/df_decimal_bloom_repro'
WHERE decimal_col = CAST(5 AS DECIMAL(19,2));
SET datafusion.execution.parquet.bloom_filter_on_read = false;
SELECT COUNT(*) AS cnt
FROM '/tmp/df_decimal_bloom_repro'
WHERE decimal_col = CAST(5 AS DECIMAL(19,2));
```
The first query returns 0, while the second query returns 1.
### Expected behavior
Both queries should return 1. Bloom filter pruning should not remove a row
group that contains the matching decimal value.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]