etseidl commented on code in PR #8476:
URL: https://github.com/apache/arrow-rs/pull/8476#discussion_r2388268730


##########
parquet/THRIFT.md:
##########
@@ -0,0 +1,446 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Thrift serialization in the parquet crate
+
+For both performance and flexibility reasons, this crate uses custom Thrift 
parsers and
+serialization mechanisms. For many of the objects defined by the Parquet 
specification macros
+are used to generate the objects as well as the code to serialize them. But in 
certain instances
+(performance bottlenecks, additions to the spec, etc.),it becomes necessary to 
implement the
+serialization code manually. This document serves to document both the 
standard usage of the
+Thrift macros, as well as how to implement custom encoders and decoders.
+
+## Thrift macros
+
+The Parquet specification utilizes Thrift enums, unions, and structs, defined 
by an Interface
+Description Language (IDL). This IDL is usually parsed by a Thrift code 
generator to produce
+language specific structures and serialization/deserialization code. This 
crate, however, uses
+Rust macros do perform the same function. This allows for customizations that 
produce more
+performant code, as well as the ability to pick and choose which fields to 
process.
+
+### Enums
+
+Thrift enums are the simplest structure, and are logically identical to Rust 
enums with unit
+variants. The IDL description will look like
+
+```
+enum Type {
+  BOOLEAN = 0;
+  INT32 = 1;
+  INT64 = 2;
+  INT96 = 3;
+  FLOAT = 4;
+  DOUBLE = 5;
+  BYTE_ARRAY = 6;
+  FIXED_LEN_BYTE_ARRAY = 7;
+}
+```
+
+The `thrift_enum` macro can be used in this instance.
+
+```rust
+thrift_enum!(
+    enum Type {
+  BOOLEAN = 0;
+  INT32 = 1;
+  INT64 = 2;
+  INT96 = 3;
+  FLOAT = 4;
+  DOUBLE = 5;
+  BYTE_ARRAY = 6;
+  FIXED_LEN_BYTE_ARRAY = 7;
+}
+);
+```
+
+which will produce a public Rust enum
+
+```rust
+pub enum Type {
+  BOOLEAN,
+  INT32,
+  INT64,
+  INT96,
+  FLOAT,
+  DOUBLE,
+  BYTE_ARRAY,
+  FIXED_LEN_BYTE_ARRAY,
+}
+```
+
+### Unions
+
+Thrift unions are a special kind of struct in which only a single field is 
populated. In this
+regard they are much like Rust enums which can have a mix of unit and tuple 
variants. Because of
+this flexibility, specifying unions is a little bit trickier.
+
+Often times a union will be defined for which all the variants are typed with 
empty structs. For
+example the `TimeUnit` union used for `LogicalType`s.
+
+```
+struct MilliSeconds {}
+struct MicroSeconds {}
+struct NanoSeconds {}
+union TimeUnit {
+  1: MilliSeconds MILLIS
+  2: MicroSeconds MICROS
+  3: NanoSeconds NANOS
+}
+```
+
+When serialized, these empty structs become a single `0` (to mark the end of 
the struct). As an
+optimization, and to allow for a simpler interface, the 
`thrift_union_all_empty` macro can be used.
+
+```rust
+thrift_union_all_empty!(
+union TimeUnit {
+  1: MilliSeconds MILLIS
+  2: MicroSeconds MICROS
+  3: NanoSeconds NANOS
+}
+);
+```
+
+This macro will ignore the types specified for each variant, and will produce 
the following Rust
+`enum`:
+
+```rust
+pub enum TimeUnit {
+    MILLIS,
+    MICROS,
+    NANOS,
+}
+```
+
+For unions with mixed variant types, some modifications to the IDL are 
necessary. Take the
+definition of `ColumnCryptoMetadata`.
+
+```
+struct EncryptionWithFooterKey {
+}
+
+struct EncryptionWithColumnKey {
+  /** Column path in schema **/
+  1: required list<string> path_in_schema
+
+  /** Retrieval metadata of column encryption key **/
+  2: optional binary key_metadata
+}
+
+union ColumnCryptoMetaData {
+  1: EncryptionWithFooterKey ENCRYPTION_WITH_FOOTER_KEY
+  2: EncryptionWithColumnKey ENCRYPTION_WITH_COLUMN_KEY
+}
+```
+
+The `ENCRYPTION_WITH_FOOTER_KEY` variant is types with an empty struct, while
+`ENCRYPTION_WITH_COLUMN_KEY` has the type of a struct with fields. In this 
case, the `thrift_union`
+macro is used.
+
+```rust
+thrift_union!(
+union ColumnCryptoMetaData {
+  1: ENCRYPTION_WITH_FOOTER_KEY
+  2: (EncryptionWithColumnKey) ENCRYPTION_WITH_COLUMN_KEY
+}
+);
+```
+
+Here, the type has been omitted for `ENCRYPTION_WITH_FOOTER_KEY` to indicate 
it should be a unit
+variant, while the type for `ENCRYPTION_WITH_COLUMN_KEY` is enclosed in 
parens. The parens are
+necessary to provide a semantic clue to the macro that the identifier is a 
type. The above will
+produce the Rust enum
+
+```rust
+pub enum ColumnCryptoMetaData {
+    ENCRYPTION_WITH_FOOTER_KEY,
+    ENCRYPTION_WITH_COLUMN_KEY(EncryptionWithColumnKey),
+}
+```
+
+### Structs
+
+The `thrift_struct` macro is used for structs. This macro is a little more 
flexible than the others
+because it allows for the visibility to be specified, and also allows for 
lifetimes to be specified
+for the defined structs as well as their fields. An example of this is the 
`SchemaElement` struct.
+This is defined in this crate as
+
+```rust
+thrift_struct!(
+pub(crate) struct SchemaElement<'a> {
+  1: optional Type type_;
+  2: optional i32 type_length;
+  3: optional Repetition repetition_type;
+  4: required string<'a> name;
+  5: optional i32 num_children;
+  6: optional ConvertedType converted_type;
+  7: optional i32 scale
+  8: optional i32 precision
+  9: optional i32 field_id;
+  10: optional LogicalType logical_type
+}
+);
+```
+
+Here the `string` field `name` is given a lifetime annotation, which is then 
propagated to the
+struct definition. Without this annotation, the resultant field would be a 
`String` type, rather
+than a string slice. The visibility of this struct (and all fields) will be 
`pub(crate)`. The
+resultant Rust struct will be
+
+```rust
+pub(crate) struct SchemaElement<'a> {
+    pub(crate) type_: Type, // here we've changed the name `type` to `type_` 
to avoid reserved words
+    pub(crate) type_length: i32,
+    pub(crate) repetition_type: Repetition,
+    pub(crate) name: &'a str,
+    ...
+}
+```
+
+The lifetime annotations can also be added to list elements, as in
+
+```rust
+thrift_struct!(
+struct FileMetaData<'a> {
+  /** Version of this file **/
+  1: required i32 version
+  2: required list<'a><SchemaElement> schema;
+  3: required i64 num_rows
+  4: required list<'a><RowGroup> row_groups
+  5: optional list<KeyValue> key_value_metadata
+  6: optional string created_by
+  7: optional list<ColumnOrder> column_orders;
+  8: optional EncryptionAlgorithm encryption_algorithm
+  9: optional binary footer_signing_key_metadata
+}
+);
+```
+
+Note that the lifetime annotation precedes the element type specification.
+
+## Serialization traits
+
+Serialization is performed via several Rust traits. On the deserialization, 
objects implement
+the `ReadThrift` trait. This defines a `read_thrift` function that takes a
+`ThriftCompactInputProtocol` I/O object as an argument. The `read_thrift` 
function performs
+all steps necessary to deserialize the object from the input stream, and is 
usually produced by
+one of the macros mentioned above.
+
+On the serialization side, the `WriteThrift` and `WriteThriftField` traits are 
used in conjunction
+with a `ThriftCompactOutputProtocol` struct. As above, the Thrift macros 
produce the necessary
+implementations needed to perform serialization.
+
+While the macros can be used in most circumstances, sometimes more control is 
needed. The following
+sections provide information on how to provide custom implementations for the 
serialization
+traits.
+
+### ReadThrift Customization
+
+Thrift enums are serialized as a single `i32` value. The process of reading an 
enum is straightforward:
+read the enum discriminant, and then match on the possible values. For 
instance, reading the
+`ConvertedType` enum becomes:
+
+```rust
+impl<'a, R: ThriftCompactInputProtocol<'a>> ReadThrift<'a, R> for 
ConvertedType {
+    fn read_thrift(prot: &mut R) -> Result<Self> {
+        let val = prot.read_i32()?;
+        Ok(match val {
+            0 => Self::UTF8,
+            1 => Self::MAP,
+            2 => Self::MAP_KEY_VALUE,
+            ...
+            21 => Self::INTERVAL,
+            _ => return Err(general_err!("Unexpected ConvertedType {}", val)),

Review Comment:
   Hmm...I surprised clippy didn't catch this. I'll have to check the macros.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to