mr-brobot commented on code in PR #8551: URL: https://github.com/apache/arrow-rs/pull/8551#discussion_r2481905598
########## parquet/src/arrow/bloom_filter.rs: ########## @@ -0,0 +1,667 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Arrow-aware bloom filter +//! +//! This module provides [`ArrowSbbf`], a wrapper around [`Sbbf`] that handles +//! Arrow-to-Parquet type coercion when checking bloom filters. +//! +//! # Overview +//! +//! Arrow supports types like `Int8` and `Int16` that have no corresponding Parquet type. +//! Data of these types must be coerced to the appropriate Parquet type when writing. +//! +//! For example, Arrow small integer types are coerced when writing to Parquet: +//! - `Int8` → `INT32` (i8 → i32) +//! - `Int16` → `INT32` (i16 → i32) +//! - `UInt8` → `INT32` (u8 → u32 → i32) +//! - `UInt16` → `INT32` (u16 → u32 → i32) +//! +//! [`ArrowSbbf`] wraps an [`Sbbf`] and an Arrow [`arrow_schema::DataType`], automatically coercing +//! values to their Parquet representation before checking the bloom filter. +//! +//! # Example +//! +//! ```ignore +//! use arrow_schema::DataType; +//! use parquet::arrow::bloom_filter::ArrowSbbf; +//! +//! // Get bloom filter from row group reader +//! let sbbf = row_group_reader.get_column_bloom_filter(0)?; +//! +//! // Create ArrowSbbf with the logical Arrow type +//! let arrow_sbbf = ArrowSbbf::new(&sbbf, &DataType::Int8); +//! +//! // Check with i8 value - automatically coerced to i32 +//! let value: i8 = 42; +//! if arrow_sbbf.check(&value) { +//! println!("Value might be present"); +//! } +//! ``` +//! +//! # Date64 with Type Coercion +//! +//! `Date64` requires special handling because it can be stored as either INT32 or INT64 +//! depending on [`WriterProperties::set_coerce_types`]. +//! +//! When `coerce_types=true`, Date64 is stored as Date32 (INT32, days since epoch). +//! Users must convert their Date64 values before checking: +//! +//! ```ignore +//! use arrow_array::temporal_conversions::MILLISECONDS_IN_DAY; +//! use arrow_array::Date64Array; +//! use arrow_cast::cast; +//! use arrow_schema::DataType; +//! +//! let date64_value = 864_000_000_i64; // milliseconds +//! +//! // Check the column's physical type to determine conversion +//! let column_chunk = row_group_reader.metadata().column(0); +//! match column_chunk.column_type() { +//! ParquetType::INT32 => { +//! // Date64 was coerced to Date32 - convert milliseconds to days +//! let date32_value = (date64_value / MILLISECONDS_IN_DAY) as i32; Review Comment: i do prefer the ergonomics of an array kernel. applies nicely to datafusion, which interacts with bloom filters exclusively via [`BloomFilterStatistics::contained`](https://github.com/apache/datafusion/blob/f57da83aac35f0ea4506ccb6a4ddbd26a503c1c1/datafusion/datasource-parquet/src/row_group_filter.rs#L349-L354) perhaps i can implement as array kernel and benchmark, then we can decide from there? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
