sahuagin opened a new pull request, #9770:
URL: https://github.com/apache/arrow-rs/pull/9770

     Which issue does this PR close?
   
     Depends on #9769 (the decoder PR — diff includes those changes while that 
PR is pending).
   
     Rationale for this change
   
     Exposes the scan_filtered miniblock-level predicate pushdown added in 
#9769 through the full column reader stack and as a public API. For mandatory 
DELTA_BINARY_PACKED
     INT32/INT64 columns, entire miniblocks (32/64 values) can be skipped 
without decoding when a caller-supplied range predicate rules them out. This is 
especially effective
     for monotone columns (timestamps, sequence numbers, auto-increment IDs) 
where bw=0 blocks allow O(1) skipping.
   
     What changes are included in this PR?
   
     Wiring chain (bottom to top):
     - ColumnValueDecoderImpl::scan_filtered_values() — dispatches to decoder
     - GenericColumnReader::scan_filtered_records() — mandatory columns only; 
optional/repeated fall back to full decode to keep def/rep levels in sync
     - GenericRecordReader::scan_filtered_records() — page-switching loop with 
same fallback
     - ArrayReader::scan_records() — new provided trait method (default = 
read_records, safe for all encodings/types)
     - PrimitiveArrayReader::scan_records() — override that calls the above 
chain
     - StructArrayReader::scan_records() — delegates to its single child for 
single-column projections; multi-column projections fall back to read_records
     - ParquetRecordBatchReader::next_inner — uses scan_records in the All 
selection branch when a predicate is set
     - ArrowReaderBuilder::with_miniblock_predicate() — fluent setter; works 
for both ParquetRecordBatchReaderBuilder (sync) and 
ParquetRecordBatchStreamBuilder (async)
   
     New public API:
     pub type MiniblockPredicate = Arc<dyn Fn(i64, i64) -> bool + Send + Sync>;
   
     // on ArrowReaderBuilder<T>:
     pub fn with_miniblock_predicate(self, predicate: MiniblockPredicate) -> 
Self;
   
     Example:
     let pred: MiniblockPredicate = Arc::new(|_lo, hi| hi >= 1_000_000);
     let reader = ParquetRecordBatchReaderBuilder::try_new(file)?
         .with_projection(mask)
         .with_miniblock_predicate(pred)
         .build()?;
   
     Limitations (documented in MiniblockPredicate rustdoc):
     - Multi-column projections fall back to full decode (each column has 
independent value ranges)
     - Optional/repeated columns fall back (def/rep level synchronization 
required)
     - No file format changes; miniblock ranges are computed on-the-fly from 
data already present in DELTA_BINARY_PACKED block headers
   
     Are these changes tested?
   
     - test_scan_records_delta_binary_packed_mandatory in primitive_array.rs: 
unit test exercising the full wiring chain on a mandatory INT64 DELTA column, 
verifying correct
     miniblock-level skipping and no false negatives
     - test_with_miniblock_predicate_single_column in arrow_reader/mod.rs: 
end-to-end test through ParquetRecordBatchReaderBuilder verifying the public 
API, correct output,
     and that fewer rows are returned than a full read
   
     Are there any user-facing changes?
   
     Yes — two additive public API items:
     - MiniblockPredicate type alias
     - ArrowReaderBuilder::with_miniblock_predicate() method
   
     No breaking changes. The new scan_records method on the ArrayReader trait 
has a provided default (read_records) so no existing implementations are 
affected.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to