etseidl commented on code in PR #8797:
URL: https://github.com/apache/arrow-rs/pull/8797#discussion_r2515260496
##########
parquet/src/file/metadata/mod.rs:
##########
@@ -1050,12 +1051,43 @@ impl ColumnChunkMetaData {
self.geo_statistics.as_deref()
}
- /// Returns the offset for the page encoding stats,
- /// or `None` if no page encoding stats are available.
+ /// Returns the page encoding statistics, or `None` if no page encoding
statistics
+ /// are available.
pub fn page_encoding_stats(&self) -> Option<&Vec<PageEncodingStats>> {
self.encoding_stats.as_ref()
}
+ /// Returns the page encoding statistics reduced to a bitmask, or `None`
if statistics are
+ /// not available.
+ ///
+ /// The [`PageEncodingStats`] struct was added to the Parquet
specification specifically to
+ /// enable fast determination of whether all pages in a column chunk are
dictionary encoded
+ /// (see <https://github.com/apache/parquet-format/pull/16>).
+ /// Decoding the full page encoding statistics, however, can be very
costly, and is not
+ /// necessary to support the aforementioned use case. As an alternative,
this crate can
+ /// instead distill the list of `PageEncodingStats` down to a bitmask of
just the encodings
+ /// used for data pages
+ /// (see [`ParquetMetaDataOptions::set_encoding_stats_as_mask`]).
+ /// To test for an all-dictionary-encoded chunk one could use this bitmask
in the following way:
+ ///
+ /// ```rust
+ /// use parquet::basic::Encoding;
+ /// use parquet::file::metadata::ColumnChunkMetaData;
+ /// // test if all data pages in the column chunk are dictionary encoded
+ /// fn is_all_dictionary_encoded(col_meta: &ColumnChunkMetaData) -> bool {
+ /// // check that dictionary encoding was used
+ /// col_meta.dictionary_page_offset().is_some()
+ /// && col_meta.page_encoding_stats_mask().is_some_and(|mask| {
+ /// // mask should only have one bit set, either for
PLAIN_DICTIONARY or
+ /// // RLE_DICTIONARY
+ /// mask.is_only(Encoding::PLAIN_DICTIONARY) ||
mask.is_only(Encoding::RLE_DICTIONARY)
+ /// })
+ /// }
+ /// ```
+ pub fn page_encoding_stats_mask(&self) -> Option<&EncodingMask> {
Review Comment:
I wonder if this should be `data_page_encoding_stats_mask` (or just
`data_page_encoding_stats`) to make it clear it only has the stats for data
pages.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]