alamb commented on code in PR #8587:
URL: https://github.com/apache/arrow-rs/pull/8587#discussion_r2421667109
##########
parquet/src/file/metadata/mod.rs:
##########
@@ -1148,7 +1155,7 @@ impl ColumnChunkMetaDataBuilder {
fn new(column_descr: ColumnDescPtr) -> Self {
Self(ColumnChunkMetaData {
column_descr,
- encodings: Vec::new(),
+ encodings: 0,
Review Comment:
No allocations

##########
parquet/src/basic.rs:
##########
@@ -724,6 +724,58 @@ impl FromStr for Encoding {
}
}
+const MAX_ENCODING: i32 = Encoding::BYTE_STREAM_SPLIT as i32;
+
+/// Given a Thrift input stream, read a vector of [`Encoding`] enums and
convert to a bitmask.
+pub(super) fn thrift_encodings_to_mask<'a, R: ThriftCompactInputProtocol<'a>>(
+ prot: &mut R,
+) -> Result<i32> {
+ let mut mask = 0;
+
+ let list_ident = prot.read_list_begin()?;
+ for _ in 0..list_ident.size {
+ let val = i32::read_thrift(prot)?;
+ if (0..=MAX_ENCODING).contains(&val) {
+ mask |= 1 << val;
+ }
+ }
+ Ok(mask)
+}
+
+#[allow(deprecated)]
Review Comment:
what is deprecated about this?
##########
parquet/src/file/metadata/mod.rs:
##########
@@ -968,8 +970,13 @@ impl ColumnChunkMetaData {
}
/// All encodings used for this column.
- pub fn encodings(&self) -> &Vec<Encoding> {
- &self.encodings
+ pub fn encodings(&self) -> Vec<Encoding> {
Review Comment:
I also recommend changing the signature to avoid *forcing* an allocation --
for example how about returning an iterator like this?
```suggestion
pub fn encodings(&self) -> impl Iterator<Item = &Encoding> {
```
##########
parquet/src/file/metadata/mod.rs:
##########
@@ -968,8 +970,13 @@ impl ColumnChunkMetaData {
}
/// All encodings used for this column.
- pub fn encodings(&self) -> &Vec<Encoding> {
- &self.encodings
+ pub fn encodings(&self) -> Vec<Encoding> {
+ mask_to_encodings_vec(self.encodings)
+ }
+
+ /// All encodings used for this column, returned as a bitmask.
+ pub fn encodings_mask(&self) -> i32 {
Review Comment:
I think this will be very hard to use as a user given the lack of
documentation / export of how to interpret the values.
I left a suggestion above about how to improve it
##########
parquet/src/file/metadata/encryption.rs:
##########
@@ -0,0 +1,544 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// a collection of generated structs used to parse thrift metadata
Review Comment:
technically speaking they aren't generated anymore
##########
parquet/src/basic.rs:
##########
@@ -724,6 +724,58 @@ impl FromStr for Encoding {
}
}
+const MAX_ENCODING: i32 = Encoding::BYTE_STREAM_SPLIT as i32;
Review Comment:
It is probably my OOP / old age, but I would find this API easier to deal
with if it were wrapped in a type that had a documented API
Something like
```rust
struct EncodingMask(i32);
impl EncodingMask {
fn new(val: i32) -> Self {
...
}
fn as_i32(&self) -> i32 {
self.0
}
}
// then you can implement From impls to/from i32 as well as a no allocation
iterator
```
I am happy to help hack this out if you like the idea
##########
parquet/src/file/metadata/mod.rs:
##########
@@ -968,8 +970,13 @@ impl ColumnChunkMetaData {
}
/// All encodings used for this column.
- pub fn encodings(&self) -> &Vec<Encoding> {
- &self.encodings
+ pub fn encodings(&self) -> Vec<Encoding> {
Review Comment:
How about we also mark this function as `#[deprecated]` to help people know
they should use the more efficient `encoding_mask` if they are using it?
##########
parquet/src/file/metadata/encryption.rs:
##########
Review Comment:
Since this is thrift parsing related, perhaps it would be better organized
as a submodule of thrift_get?
In other words, how about renaming it into
`parquet/src/file/metadata/thrift_gen/encryption.rs`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]