alamb commented on code in PR #8797: URL: https://github.com/apache/arrow-rs/pull/8797#discussion_r2534131627
########## parquet/src/file/metadata/options.rs: ########## @@ -0,0 +1,191 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Options used to control metadata parsing + +use paste::paste; +use std::collections::HashSet; +use std::sync::Arc; + +use crate::schema::types::SchemaDescPtr; + +/// Options that can be set to control what parts of the Parquet file footer +/// metadata will be decoded and made present in the [`ParquetMetaData`] returned +/// by [`ParquetMetaDataReader`] and [`ParquetMetaDataPushDecoder`]. +/// +/// [`ParquetMetaData`]: crate::file::metadata::ParquetMetaData +/// [`ParquetMetaDataReader`]: crate::file::metadata::ParquetMetaDataReader +/// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder +#[derive(Default, Debug, Clone)] +pub struct ParquetMetaDataOptions { + schema_descr: Option<SchemaDescPtr>, + encoding_stats_as_mask: bool, Review Comment: I personally suggest we: file a ticket / PR to change the default in the next major release ########## parquet/src/file/metadata/options.rs: ########## @@ -29,6 +33,24 @@ use crate::schema::types::SchemaDescPtr; #[derive(Default, Debug, Clone)] pub struct ParquetMetaDataOptions { schema_descr: Option<SchemaDescPtr>, + encoding_stats_as_mask: bool, Review Comment: It took me a little while to grok how encoding_stats_as_mask and skip_encoding_stats were related. But now I see they are mutually exclusive I think maybe creating a enum rather than Option / Option /etc would be clearer to me but since it is an implementation detail this is also find ########## parquet/src/file/metadata/options.rs: ########## @@ -0,0 +1,191 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Options used to control metadata parsing + +use paste::paste; +use std::collections::HashSet; +use std::sync::Arc; + +use crate::schema::types::SchemaDescPtr; + +/// Options that can be set to control what parts of the Parquet file footer +/// metadata will be decoded and made present in the [`ParquetMetaData`] returned +/// by [`ParquetMetaDataReader`] and [`ParquetMetaDataPushDecoder`]. +/// +/// [`ParquetMetaData`]: crate::file::metadata::ParquetMetaData +/// [`ParquetMetaDataReader`]: crate::file::metadata::ParquetMetaDataReader +/// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder +#[derive(Default, Debug, Clone)] +pub struct ParquetMetaDataOptions { + schema_descr: Option<SchemaDescPtr>, + encoding_stats_as_mask: bool, + // The outer option acts as a global boolean, so if `skip_encoding_stats.is_some()` + // is `true` then we're at least skipping some stats. The inner `Option` is a keep + // list of column indicies to decode. + skip_encoding_stats: Option<Option<Arc<HashSet<usize>>>>, Review Comment: Maybe a bitmask would be the better (but that can be a subsequent PR perhaps) ########## parquet/src/file/metadata/options.rs: ########## @@ -48,11 +70,70 @@ impl ParquetMetaDataOptions { self.schema_descr = Some(val); } - /// Provide a schema to use when decoding the metadata. Returns `Self` for chaining. - pub fn with_schema(mut self, val: SchemaDescPtr) -> Self { - self.schema_descr = Some(val); - self + // with_schema + add_mutator!(schema, SchemaDescPtr); + + /// Returns whether to present the `encoding_stats` field of the `ColumnMetaData` as a + /// bitmask. + /// + /// See [`ColumnChunkMetaData::page_encoding_stats_mask`] for an explanation of why this + /// might be desirable. + /// + /// [`ColumnChunkMetaData::page_encoding_stats_mask`]: + /// crate::file::metadata::ColumnChunkMetaData::page_encoding_stats_mask + pub fn encoding_stats_as_mask(&self) -> bool { + self.encoding_stats_as_mask + } + + /// Convert `encoding_stats` from a vector of [`PageEncodingStats`] to a bitmask. This can + /// speed up metadata decoding while still enabling some use cases served by the full stats. + /// + /// See [`ColumnChunkMetaData::page_encoding_stats_mask`] for more information. + /// + /// [`PageEncodingStats`]: crate::file::metadata::PageEncodingStats + /// [`ColumnChunkMetaData::page_encoding_stats_mask`]: + /// crate::file::metadata::ColumnChunkMetaData::page_encoding_stats_mask + pub fn set_encoding_stats_as_mask(&mut self, val: bool) { + self.encoding_stats_as_mask = val; } + + // with_encoding_stats_as_mask + add_mutator!(encoding_stats_as_mask, bool); Review Comment: > Now that I've done the macro, it saves all of 3 lines per instance. I'm fine with doing away with it. I thought it was pretty clear ########## parquet/src/file/metadata/mod.rs: ########## @@ -1274,7 +1321,13 @@ impl ColumnChunkMetaDataBuilder { /// Sets page encoding stats for this column chunk. pub fn set_page_encoding_stats(mut self, value: Vec<PageEncodingStats>) -> Self { - self.0.encoding_stats = Some(value); + self.0.encoding_stats = Some(ParquetPageEncodingStats::Full(value)); Review Comment: It might be nice here in the comments to call out that setting the stats will override and mask that was set, and vice versa. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
