alamb commented on code in PR #6068: URL: https://github.com/apache/arrow-rs/pull/6068#discussion_r1681702498
########## parquet/src/column/writer/mod.rs: ########## @@ -204,6 +204,33 @@ struct ColumnMetrics<T> { max_column_value: Option<T>, num_column_nulls: u64, column_distinct_count: Option<u64>, + variable_length_bytes: Option<i64>, +} + +impl<T> ColumnMetrics<T> { + fn new() -> Self { Review Comment: We could probably save some typing here by using ```rust #[derive(Default)] struct ColumnMetrics<T> { .. ``` And then ```rust impl<T> ColumnMetrics<T> { fn new() -> Self { Default::default() } } ########## parquet/src/data_type.rs: ########## @@ -956,6 +963,10 @@ pub(crate) mod private { Ok(num_values) } + fn variable_length_bytes(values: &[Self]) -> Option<i64> { + Some(values.iter().map(|x| x.len() as i64).sum()) Review Comment: I double checked and this is a `&[ByteArray]` so it is summing up the lengths of the buffers, not the lengths of the individual strings 👍 ########## parquet/src/file/page_index/index_reader.rs: ########## @@ -81,9 +82,9 @@ pub fn read_columns_indexes<R: ChunkReader>( /// Return an empty vector if this row group does not contain an /// [`OffsetIndex]`. /// -/// See [Column Index Documentation] for more details. +/// See [Page Index Documentation] for more details. /// -/// [Column Index Documentation]: https://github.com/apache/parquet-format/blob/master/PageIndex.md +/// [Page Index Documentation]: https://github.com/apache/parquet-format/blob/master/PageIndex.md pub fn read_pages_locations<R: ChunkReader>( Review Comment: I wonder should we deprecate `read_pages_location` and instead encourage people to only use `read_offset_indexes` (which contain the page location too?) I think that would be less confusing to me as it took a while to realize that PageLocation is stored within the OffsetIndex ########## parquet/src/file/page_index/index_reader.rs: ########## @@ -109,7 +146,13 @@ pub fn read_pages_locations<R: ChunkReader>( .collect() } -pub(crate) fn decode_offset_index(data: &[u8]) -> Result<Vec<PageLocation>, ParquetError> { +pub(crate) fn decode_offset_index(data: &[u8]) -> Result<OffsetSizeIndex, ParquetError> { + let mut prot = TCompactSliceInputProtocol::new(data); Review Comment: These functions both seem to decode the `OffsetIndex` which I think means it happens twice 🤔 ########## parquet/src/file/metadata/mod.rs: ########## @@ -179,6 +184,16 @@ impl ParquetMetaData { self.offset_index.as_ref() } + /// Returns `unencoded_byte_array_data_bytes` from the offset indexes in this file, if loaded Review Comment: I think it would help to add some comments about what this value represents ```suggestion /// Returns `unencoded_byte_array_data_bytes` from the offset indexes in this file, if loaded /// /// This value represents the output size of the total bytes in this file, which can be useful for /// allocating an appropriately sized output buffer. ``` Writing the comment makes me think that maybe it would be good if we could update the parquet reader to use these values to reserve buffer sizes and avoid some copies 🤔 ########## parquet/src/file/page_index/offset_index.rs: ########## @@ -0,0 +1,50 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! [`OffsetSizeIndex`] structure holding decoded [`OffsetIndex`] information + +use crate::errors::ParquetError; +use crate::format::{OffsetIndex, PageLocation}; + +/// [`OffsetIndex`] information for a column chunk. Contains offsets and sizes for each page +/// in the chunk. Optionally stores fully decoded page sizes for BYTE_ARRAY columns. +#[derive(Debug, Clone, PartialEq)] +pub struct OffsetSizeIndex { Review Comment: Why is this called Offset**Size**Index? It seems like the corresponding thrift structure is called "OffsetIndex" https://github.com/apache/parquet-format/blob/abb92e61e9756016c08401fef729d2ace187d184/src/main/thrift/parquet.thrift#L1031 What do you think about calling it `ParquetOffsetIndex` to follow the convention of `ParquetMetaData` (though I will freely admit the naming of these structures makes it very hard to reason about) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org