viirya commented on code in PR #9628:
URL: https://github.com/apache/arrow-rs/pull/9628#discussion_r3018511311
##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -3142,6 +3148,41 @@ mod tests {
);
}
+ /// Test that bloom filter folding produces correct results even when
+ /// the configured NDV differs significantly from actual NDV.
+ /// A large NDV means a larger initial filter that gets folded down;
+ /// a small NDV means a smaller initial filter.
+ #[test]
+ fn i32_column_bloom_filter_fixed_ndv() {
+ let array = Arc::new(Int32Array::from_iter(0..SMALL_SIZE as i32));
+
+ // NDV much larger than actual distinct values — tests folding a large
filter down
+ let mut options = RoundTripOptions::new(array.clone(), false);
+ options.bloom_filter = true;
+ options.bloom_filter_ndv = Some(1_000_000);
+
+ let files = one_column_roundtrip_with_options(options);
+ check_bloom_filter(
+ files,
+ "col".to_string(),
+ (0..SMALL_SIZE as i32).collect(),
+ (SMALL_SIZE as i32 + 1..SMALL_SIZE as i32 + 10).collect(),
+ );
+
+ // NDV smaller than actual distinct values — tests the underestimate
path
Review Comment:
array has only 7 distinct value. So "NDV smaller than actual distinct
values" seems incorrect?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]