mr-brobot commented on issue #5550: URL: https://github.com/apache/arrow-rs/issues/5550#issuecomment-2028482716
## Overview The general issue is that Parquet types are a subset of Arrow types, so the Arrow writer must coerce to Parquet types. In some cases, this changes the physical representation. Thus, bloom filter checks might produce false negatives if passing the original Arrow representation. Correctness is only guaranteed when checking with the coerced Parquet value. For the issue mentioned above, [Parquet only supports `Int32` and `Int64` integer types:](https://parquet.apache.org/docs/file-format/types/) > 16-bit ints are not explicitly supported in the storage format since they are covered by 32-bit ints with an efficient encoding `Int8` and `Int16` Arrow types are widened to `Int32` when writing. For correct bloom filter behavior, they must be widened prior to calling `Sbbf::check`. Widening before checking causes the tests mentioned above to pass. However, this issue is not limited to `Int8` and `Int16`. For example, when writing `Date64`, values are coerced to `Date32` and then to `Int32`. This changes the physical representation from `Int64` "milliseconds since epoch" to `Int32` "days since epoch". ```rust #[test] fn date64_column_bloom_filter() { use chrono::NaiveDate; // construct input data let epoch = NaiveDate::from_ymd_opt(1970, 1, 1).unwrap(); let dates = vec![ NaiveDate::from_ymd_opt(1915, 11, 25).unwrap(), NaiveDate::from_ymd_opt(1964, 4, 14).unwrap(), NaiveDate::from_ymd_opt(1992, 9, 2).unwrap(), NaiveDate::from_ymd_opt(2024, 1, 1).unwrap(), ]; let values: Vec<_> = dates .iter() .map(|d| d.signed_duration_since(epoch).num_milliseconds()) .collect(); // write (asserts written values match input) let array = Arc::new(Date64Array::from(values.clone())); let mut options = RoundTripOptions::new(array, false); options.bloom_filter = true; let files = one_column_roundtrip_with_options(options); // checks bloom filter with original input // fails with message: Value [0, 100, 252, 121, 114, 254, 255, 255] should be in bloom filter check_bloom_filter( files, "col".to_string(), values, (0..SMALL_SIZE as i64).collect(), ); // convert to "days since epoch" let values: Vec<_> = dates .iter() .map(|d| d.signed_duration_since(epoch).num_days() as i32) .collect(); // passes check_bloom_filter( files, "col".to_string(), values, (0..SMALL_SIZE as i64).collect(), ); } ``` I think the same would apply to decimal and interval types. ## Solution This only affects scenarios where Parquet was written via `ArrowWriter`. The solution should not impact scenarios where the Parquet reader/writer APIs are used directly. Since the SBBF is checked by the crate user, part of the solution involves documenting how the Arrow-Parquet type coercion affects the physical representation and, therefore, the bloom filter. However, I think we can make some changes to improve the safety/convenience of working with bloom filters. Some options: 1. Introduce an `arrow_value_to_parquet_value` function for callers to use prior to calling `Sbbf::check`. 2. Introduce a new function, perhaps named `arrow_sbbf_check`, that accepts an Arrow array and bloom filter. This method would map the Arrow type to the correct Parquet type before checking the bloom filter, returning a `BooleanArray`. 3. Optionally use the bloom filter in the Arrow reader. This can perform the same type coercion that occurs in Arrow writer. This way, Arrow users never need to use the `Sbbf` API directly. [There is a similar conversation happening regarding the C++ implementation](https://github.com/apache/arrow/issues/33683). The first two options are simple but still push some complexity to Arrow users. I think the last option is ideal but definitely a much larger effort. But, these options aren't mutually exclusive. Would love to hear other thoughts on this. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
