mr-brobot commented on issue #5550:
URL: https://github.com/apache/arrow-rs/issues/5550#issuecomment-2028482716

   ## Overview
   
   The general issue is that Parquet types are a subset of Arrow types, so the 
Arrow writer must coerce to Parquet types. In some cases, this changes the 
physical representation. Thus, bloom filter checks might produce false 
negatives if passing the original Arrow representation. Correctness is only 
guaranteed when checking with the coerced Parquet value.
   
   For the issue mentioned above, [Parquet only supports `Int32` and `Int64` 
integer types:](https://parquet.apache.org/docs/file-format/types/)
   > 16-bit ints are not explicitly supported in the storage format since they 
are covered by 32-bit ints with an efficient encoding
   
   `Int8` and `Int16` Arrow types are widened to `Int32` when writing. For 
correct bloom filter behavior, they must be widened prior to calling 
`Sbbf::check`. Widening before checking causes the tests mentioned above to 
pass. 
   
   However, this issue is not limited to `Int8` and `Int16`. For example, when 
writing `Date64`, values are coerced to `Date32` and then to `Int32`. This 
changes the physical representation from `Int64` "milliseconds since epoch" to 
`Int32` "days since epoch". 
   
   ```rust
   #[test]
   fn date64_column_bloom_filter() {
       use chrono::NaiveDate;
   
       // construct input data
       let epoch = NaiveDate::from_ymd_opt(1970, 1, 1).unwrap();
       let dates = vec![
           NaiveDate::from_ymd_opt(1915, 11, 25).unwrap(),
           NaiveDate::from_ymd_opt(1964, 4, 14).unwrap(),
           NaiveDate::from_ymd_opt(1992, 9, 2).unwrap(),
           NaiveDate::from_ymd_opt(2024, 1, 1).unwrap(),
       ];
       let values: Vec<_> = dates
           .iter()
           .map(|d| d.signed_duration_since(epoch).num_milliseconds())
           .collect();
   
       // write (asserts written values match input)
       let array = Arc::new(Date64Array::from(values.clone()));
       let mut options = RoundTripOptions::new(array, false);
       options.bloom_filter = true;
       let files = one_column_roundtrip_with_options(options);
   
       // checks bloom filter with original input
       // fails with message: Value [0, 100, 252, 121, 114, 254, 255, 255] 
should be in bloom filter        
       check_bloom_filter(
           files,
           "col".to_string(),
           values,
           (0..SMALL_SIZE as i64).collect(),
       );
   
       // convert to "days since epoch"
       let values: Vec<_> = dates
           .iter()
           .map(|d| d.signed_duration_since(epoch).num_days() as i32)
           .collect();
   
       // passes
       check_bloom_filter(
           files,
           "col".to_string(),
           values,
           (0..SMALL_SIZE as i64).collect(),
       );
   }
   
   ```
   
   I think the same would apply to decimal and interval types.
   
   
   ## Solution
   
   This only affects scenarios where Parquet was written via `ArrowWriter`. The 
solution should not impact scenarios where the Parquet reader/writer APIs are 
used directly.
   
   Since the SBBF is checked by the crate user, part of the solution involves 
documenting how the Arrow-Parquet type coercion affects the physical 
representation and, therefore, the bloom filter. However, I think we can make 
some changes to improve the safety/convenience of working with bloom filters. 
Some options:
   1. Introduce an `arrow_value_to_parquet_value` function for callers to use 
prior to calling `Sbbf::check`. 
   2. Introduce a new function, perhaps named `arrow_sbbf_check`, that accepts 
an Arrow array and bloom filter. This method would map the Arrow type to the 
correct Parquet type before checking the bloom filter, returning a 
`BooleanArray`.
   3. Optionally use the bloom filter in the Arrow reader. This can perform the 
same type coercion that occurs in Arrow writer. This way, Arrow users never 
need to use the `Sbbf` API directly. [There is a similar conversation happening 
regarding the C++ implementation](https://github.com/apache/arrow/issues/33683).
   
   The first two options are simple but still push some complexity to Arrow 
users. I think the last option is ideal but definitely a much larger effort. 
But, these options aren't mutually exclusive. Would love to hear other thoughts 
on this. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to