Re: [PR] fix: Parquet bloom filter pruning can incorrectly filter decimals encoded as FIXED_LEN_BYTE_ARRAY [datafusion]

via GitHub Thu, 18 Jun 2026 02:07:57 -0700


kosiew commented on code in PR #22995:
URL: https://github.com/apache/datafusion/pull/22995#discussion_r3434531839



##########
datafusion/datasource-parquet/src/bloom_filter.rs:
##########
@@ -39,7 +39,8 @@ pub(crate) struct BloomFilterStatistics {
     /// Value:
     /// * [`Sbbf`] (Bloom filter),
     /// * Parquet physical [`Type`] needed to evaluate  literals against the 
filter
-    column_sbbf: HashMap<String, (Sbbf, Type)>,
+    /// * Type length from the Parquet column descriptor
+    column_sbbf: HashMap<String, (Sbbf, Type, i32)>,

Review Comment:
   This tuple now carries a pretty important `type_length` contract. A small 
named struct, such as `struct ColumnBloomFilter { sbbf: Sbbf, physical_type: 
Type, type_length: i32 }`, could make the invariant clearer and help avoid 
accidentally mixing up tuple fields at call sites.



##########
datafusion/datasource-parquet/src/bloom_filter.rs:
##########
@@ -375,6 +409,40 @@ mod tests {
             .await
     }
 
+    #[tokio::test]
+    async fn test_row_group_bloom_filter_pruning_predicate_decimal128() {

Review Comment:
   Nice regression coverage for the fixed-width truncation path. It might be 
worth adding a negative decimal case as well, since Parquet fixed-len decimal 
bytes depend on two's-complement sign extension and truncation. For example, 
you could write row groups with negative values and assert that a predicate 
like `decimal_col = -500` keeps only the matching row group.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: Parquet bloom filter pruning can incorrectly filter decimals encoded as FIXED_LEN_BYTE_ARRAY [datafusion]

Reply via email to