alamb commented on code in PR #8476:
URL: https://github.com/apache/arrow-rs/pull/8476#discussion_r2388876196
##########
parquet/src/file/metadata/mod.rs:
##########
@@ -17,9 +17,7 @@
//! Parquet metadata API
//!
-//! Most users should use these structures to interact with Parquet metadata.
-//! The [crate::format] module contains lower level structures generated from
the
-//! Parquet thrift definition.
+//! Users should use these structures to interact with Parquet metadata.
Review Comment:
🎉
##########
parquet/src/file/metadata/thrift_gen.rs:
##########
@@ -1490,9 +1500,75 @@ impl WriteThrift for ColumnChunkMetaData {
}
#[cfg(test)]
-mod tests {
- use crate::file::metadata::thrift_gen::BoundingBox;
+pub(crate) mod tests {
Review Comment:
Somewhat unrelated to this PR, but I noticed the name `thrift_gen` in the
path `parquet/src/file/metadata/thrift_gen.rs` is no longer relevant probably
##########
parquet/src/file/page_index/column_index.rs:
##########
@@ -17,7 +17,7 @@
//! [`ColumnIndexMetaData`] structures holding decoded [`ColumnIndex`]
information
//!
-//! [`ColumnIndex`]: crate::format::ColumnIndex
+//! [`ColumnIndex`]:
https://github.com/apache/parquet-format/blob/master/PageIndex.md
Review Comment:
ditto here -- let's link to the rust struct (and maybe we can link the rust
struct to the parquet format docs)
My rationale is that people looking at the Rust API will want to link to the
Rust code for manipulating these objects, not the original parquet spec as
these links are meant to help navigate this codebase
https://github.com/apache/arrow-rs/blob/e930492306b336d32742797987b680dd1a1120b4/parquet/src/format.rs#L5255-L5254
Similarly for the other links below
##########
parquet/src/file/metadata/writer.rs:
##########
@@ -317,7 +317,7 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
/// 4. Length of encoded `FileMetaData` (4 bytes, little endian)
/// 5. Parquet Magic Bytes (4 bytes)
///
-/// [`FileMetaData`]: crate::format::FileMetaData
+/// [`FileMetaData`]:
https://github.com/apache/parquet-format/tree/master?tab=readme-ov-file#metadata
Review Comment:
I think it is better here to link to the file metadata struct in the crate,
which still seems to exist:
https://github.com/apache/arrow-rs/blob/653fa1a0b2a5a13d4392775f15b5866a83605811/parquet/src/file/metadata/mod.rs#L463-L462
##########
parquet/src/parquet_thrift.rs:
##########
@@ -212,10 +212,10 @@ pub(crate) trait ThriftCompactInputProtocol<'a> {
loop {
let byte = self.read_byte()?;
in_progress |= ((byte & 0x7F) as u64).wrapping_shl(shift);
- shift += 7;
if byte & 0x80 == 0 {
return Ok(in_progress);
}
+ shift += 7;
Review Comment:
is this a bug fix?
##########
parquet/src/file/metadata/mod.rs:
##########
@@ -2100,20 +1824,18 @@ mod tests {
offset_index.append_row_count(1);
offset_index.append_offset_and_size(2, 3);
offset_index.append_unencoded_byte_array_data_bytes(Some(10));
- let offset_index = offset_index.build_to_thrift();
+ let offset_index = offset_index.build();
let parquet_meta = ParquetMetaDataBuilder::new(file_metadata)
.set_row_groups(row_group_meta)
.set_column_index(Some(vec![vec![ColumnIndexMetaData::BOOLEAN(native_index)]]))
- .set_offset_index(Some(vec![vec![
- OffsetIndexMetaData::try_new(offset_index).unwrap()
- ]]))
+ .set_offset_index(Some(vec![vec![offset_index]]))
.build();
#[cfg(not(feature = "encryption"))]
- let bigger_expected_size = 2704;
+ let bigger_expected_size = 2706;
Review Comment:
I think we can handle 2 more bytes 😆 -- though I wonder where this came from
##########
parquet/src/parquet_macros.rs:
##########
@@ -20,13 +20,22 @@
// They allow for pasting sections of the Parquet thrift IDL file
// into a macro to generate rust structures and implementations.
-// TODO(ets): These macros need a good bit of documentation so other
developers will be able
-// to use them correctly. Also need to write a .md file with complete examples
of both how
-// to use the macros, and how to implement custom readers and writers when
necessary.
+//! This is a collection of macros used to parse Thrift IDL descriptions of
structs,
Review Comment:
reading this description, maybe a better (more discoverable) name for the
module would be `parquet/src/thrift_macros.rs` instead of
`parquet/src/parquet_macros.rs`
##########
parquet/THRIFT.md:
##########
@@ -0,0 +1,446 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# Thrift serialization in the parquet crate
+
+For both performance and flexibility reasons, this crate uses custom Thrift
parsers and
+serialization mechanisms. For many of the objects defined by the Parquet
specification macros
+are used to generate the objects as well as the code to serialize them. But in
certain instances
+(performance bottlenecks, additions to the spec, etc.), it becomes necessary
to implement the
+serialization code manually. This document serves to document both the
standard usage of the
+Thrift macros, as well as how to implement custom encoders and decoders.
+
+## Thrift macros
+
+The Parquet specification utilizes Thrift enums, unions, and structs, defined
by an Interface
+Description Language (IDL). This IDL is usually parsed by a Thrift code
generator to produce
+language specific structures and serialization/deserialization code. This
crate, however, uses
+Rust macros to perform the same function. This allows for customizations that
produce more
+performant code, as well as the ability to pick and choose which fields to
process.
+
+### Enums
+
+Thrift enums are the simplest structure, and are logically identical to Rust
enums with unit
+variants. The IDL description will look like
+
+```
+enum Type {
+ BOOLEAN = 0;
+ INT32 = 1;
+ INT64 = 2;
+ INT96 = 3;
+ FLOAT = 4;
+ DOUBLE = 5;
+ BYTE_ARRAY = 6;
+ FIXED_LEN_BYTE_ARRAY = 7;
+}
+```
+
+The `thrift_enum` macro can be used in this instance.
+
+```rust
+thrift_enum!(
+ enum Type {
+ BOOLEAN = 0;
+ INT32 = 1;
+ INT64 = 2;
+ INT96 = 3;
+ FLOAT = 4;
+ DOUBLE = 5;
+ BYTE_ARRAY = 6;
+ FIXED_LEN_BYTE_ARRAY = 7;
+}
+);
+```
+
+which will produce a public Rust enum
+
+```rust
+pub enum Type {
+ BOOLEAN,
+ INT32,
+ INT64,
+ INT96,
+ FLOAT,
+ DOUBLE,
+ BYTE_ARRAY,
+ FIXED_LEN_BYTE_ARRAY,
+}
+```
+
+### Unions
+
+Thrift unions are a special kind of struct in which only a single field is
populated. In this
+regard they are much like Rust enums which can have a mix of unit and tuple
variants. Because of
+this flexibility, specifying unions is a little bit trickier.
+
+Often times a union will be defined for which all the variants are typed with
empty structs. For
+example the `TimeUnit` union used for `LogicalType`s.
+
+```
+struct MilliSeconds {}
+struct MicroSeconds {}
+struct NanoSeconds {}
+union TimeUnit {
+ 1: MilliSeconds MILLIS
+ 2: MicroSeconds MICROS
+ 3: NanoSeconds NANOS
+}
+```
+
+When serialized, these empty structs become a single `0` (to mark the end of
the struct). As an
+optimization, and to allow for a simpler interface, the
`thrift_union_all_empty` macro can be used.
+
+```rust
+thrift_union_all_empty!(
+union TimeUnit {
+ 1: MilliSeconds MILLIS
+ 2: MicroSeconds MICROS
+ 3: NanoSeconds NANOS
+}
+);
+```
+
+This macro will ignore the types specified for each variant, and will produce
the following Rust
+`enum`:
+
+```rust
+pub enum TimeUnit {
+ MILLIS,
+ MICROS,
+ NANOS,
+}
+```
+
+For unions with mixed variant types, some modifications to the IDL are
necessary. Take the
Review Comment:
I found these examples super super helpful
##########
parquet/THRIFT.md:
##########
@@ -0,0 +1,446 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# Thrift serialization in the parquet crate
+
+For both performance and flexibility reasons, this crate uses custom Thrift
parsers and
+serialization mechanisms. For many of the objects defined by the Parquet
specification macros
+are used to generate the objects as well as the code to serialize them. But in
certain instances
+(performance bottlenecks, additions to the spec, etc.), it becomes necessary
to implement the
+serialization code manually. This document serves to document both the
standard usage of the
+Thrift macros, as well as how to implement custom encoders and decoders.
+
+## Thrift macros
+
+The Parquet specification utilizes Thrift enums, unions, and structs, defined
by an Interface
+Description Language (IDL). This IDL is usually parsed by a Thrift code
generator to produce
+language specific structures and serialization/deserialization code. This
crate, however, uses
+Rust macros to perform the same function. This allows for customizations that
produce more
+performant code, as well as the ability to pick and choose which fields to
process.
+
+### Enums
+
+Thrift enums are the simplest structure, and are logically identical to Rust
enums with unit
+variants. The IDL description will look like
+
+```
+enum Type {
+ BOOLEAN = 0;
+ INT32 = 1;
+ INT64 = 2;
+ INT96 = 3;
+ FLOAT = 4;
+ DOUBLE = 5;
+ BYTE_ARRAY = 6;
+ FIXED_LEN_BYTE_ARRAY = 7;
+}
+```
+
+The `thrift_enum` macro can be used in this instance.
+
+```rust
Review Comment:
Another benefit of putting this code in the `parquet/thrift/mod.rs` module
is that these examples will be tested as part of CI
##########
parquet/THRIFT.md:
##########
@@ -0,0 +1,446 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# Thrift serialization in the parquet crate
Review Comment:
I found this easier to read via the rendered version here:
https://github.com/etseidl/arrow-rs/blob/remove_format/parquet/THRIFT.md
##########
parquet/THRIFT.md:
##########
@@ -0,0 +1,446 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# Thrift serialization in the parquet crate
+
+For both performance and flexibility reasons, this crate uses custom Thrift
parsers and
+serialization mechanisms. For many of the objects defined by the Parquet
specification macros
+are used to generate the objects as well as the code to serialize them. But in
certain instances
+(performance bottlenecks, additions to the spec, etc.), it becomes necessary
to implement the
+serialization code manually. This document serves to document both the
standard usage of the
+Thrift macros, as well as how to implement custom encoders and decoders.
+
+## Thrift macros
+
+The Parquet specification utilizes Thrift enums, unions, and structs, defined
by an Interface
+Description Language (IDL). This IDL is usually parsed by a Thrift code
generator to produce
+language specific structures and serialization/deserialization code. This
crate, however, uses
+Rust macros to perform the same function. This allows for customizations that
produce more
+performant code, as well as the ability to pick and choose which fields to
process.
Review Comment:
It also typically avoids a second copy of the structures, which might be
useful to mention
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]