Re: [PR] Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` [arrow-rs]

via GitHub Wed, 17 Jul 2024 13:55:19 -0700


alamb commented on code in PR #6068:
URL: https://github.com/apache/arrow-rs/pull/6068#discussion_r1681702498



##########
parquet/src/column/writer/mod.rs:
##########
@@ -204,6 +204,33 @@ struct ColumnMetrics<T> {
     max_column_value: Option<T>,
     num_column_nulls: u64,
     column_distinct_count: Option<u64>,
+    variable_length_bytes: Option<i64>,
+}
+
+impl<T> ColumnMetrics<T> {
+    fn new() -> Self {

Review Comment:
   We could probably save some typing here by using
   
   ```rust
   #[derive(Default)]
   struct ColumnMetrics<T> {
   ..
   ```
   
   And then 
   
   ```rust
   impl<T> ColumnMetrics<T> {
       fn new() -> Self {
         Default::default()
        }
   }



##########
parquet/src/data_type.rs:
##########
@@ -956,6 +963,10 @@ pub(crate) mod private {
             Ok(num_values)
         }
 
+        fn variable_length_bytes(values: &[Self]) -> Option<i64> {
+            Some(values.iter().map(|x| x.len() as i64).sum())

Review Comment:
   I double checked and this is a `&[ByteArray]` so it is summing up the 
lengths of the buffers, not the lengths of the individual strings 👍 



##########
parquet/src/file/page_index/index_reader.rs:
##########
@@ -81,9 +82,9 @@ pub fn read_columns_indexes<R: ChunkReader>(
 /// Return an empty vector if this row group does not contain an
 /// [`OffsetIndex]`.
 ///
-/// See [Column Index Documentation] for more details.
+/// See [Page Index Documentation] for more details.
 ///
-/// [Column Index Documentation]: 
https://github.com/apache/parquet-format/blob/master/PageIndex.md
+/// [Page Index Documentation]: 
https://github.com/apache/parquet-format/blob/master/PageIndex.md
 pub fn read_pages_locations<R: ChunkReader>(

Review Comment:
   I wonder should we deprecate `read_pages_location` and instead encourage 
people to only use `read_offset_indexes` (which contain the page location too?) 
I think that would be less confusing to me as it took a while to realize that 
PageLocation is stored within the OffsetIndex



##########
parquet/src/file/page_index/index_reader.rs:
##########
@@ -109,7 +146,13 @@ pub fn read_pages_locations<R: ChunkReader>(
         .collect()
 }
 
-pub(crate) fn decode_offset_index(data: &[u8]) -> Result<Vec<PageLocation>, 
ParquetError> {
+pub(crate) fn decode_offset_index(data: &[u8]) -> Result<OffsetSizeIndex, 
ParquetError> {
+    let mut prot = TCompactSliceInputProtocol::new(data);

Review Comment:
   These functions both seem to decode the `OffsetIndex` which I think means it 
happens twice 🤔 



##########
parquet/src/file/metadata/mod.rs:
##########
@@ -179,6 +184,16 @@ impl ParquetMetaData {
         self.offset_index.as_ref()
     }
 
+    /// Returns `unencoded_byte_array_data_bytes` from the offset indexes in 
this file, if loaded

Review Comment:
   I think it would help to add some comments about what this value represents
   
   ```suggestion
       /// Returns `unencoded_byte_array_data_bytes` from the offset indexes in 
this file, if loaded
       /// 
       /// This value represents the output size of the total bytes in this 
file, which can be useful for
       /// allocating an appropriately sized output buffer. 
   ```
   
   Writing the comment makes me think that maybe it would be good if we could 
update the parquet reader to use these values to reserve buffer sizes and avoid 
some copies 🤔 



##########
parquet/src/file/page_index/offset_index.rs:
##########
@@ -0,0 +1,50 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! [`OffsetSizeIndex`] structure holding decoded [`OffsetIndex`] information
+
+use crate::errors::ParquetError;
+use crate::format::{OffsetIndex, PageLocation};
+
+/// [`OffsetIndex`] information for a column chunk. Contains offsets and sizes 
for each page
+/// in the chunk. Optionally stores fully decoded page sizes for BYTE_ARRAY 
columns.
+#[derive(Debug, Clone, PartialEq)]
+pub struct OffsetSizeIndex {

Review Comment:
   Why is this called Offset**Size**Index? 
   
   It seems like the corresponding thrift structure is called "OffsetIndex"
   
   
https://github.com/apache/parquet-format/blob/abb92e61e9756016c08401fef729d2ace187d184/src/main/thrift/parquet.thrift#L1031
   
   What do you think about calling it `ParquetOffsetIndex` to follow the 
convention of `ParquetMetaData`  (though I will freely admit the naming of 
these structures makes it very hard to reason about)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` [arrow-rs]

Reply via email to