This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git
The following commit(s) were added to refs/heads/master by this push:
new db239e5b3 Add (more) Parquet Metadata Documentation (#6184)
db239e5b3 is described below
commit db239e5b3aa05985b0149187c8b93b88e2285b48
Author: Andrew Lamb <[email protected]>
AuthorDate: Tue Aug 6 17:13:15 2024 -0400
Add (more) Parquet Metadata Documentation (#6184)
* Minor: Add (more) Parquet Metadata Documenation
* fix clippy
---
parquet/src/file/metadata/mod.rs | 61 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)
diff --git a/parquet/src/file/metadata/mod.rs b/parquet/src/file/metadata/mod.rs
index 86c673bbd..45ef0c546 100644
--- a/parquet/src/file/metadata/mod.rs
+++ b/parquet/src/file/metadata/mod.rs
@@ -33,6 +33,67 @@
//! * [`ColumnChunkMetaData`]: Metadata for each column chunk (primitive leaf)
//! within a Row Group including encoding and compression information,
//! number of values, statistics, etc.
+//!
+//! # APIs for working with Parquet Metadata
+//!
+//! The Parquet readers and writers in this crate read and write
+//! metadata into parquet files. To work with metadata directly,
+//! the following APIs are available.
+//!
+//! Reading:
+//! * Read from bytes to `ParquetMetaData`: [`decode_footer`]
+//! and [`decode_metadata`]
+//! * Read from an `async` source to `ParquetMetadata`: [`MetadataLoader`]
+//!
+//! [`MetadataLoader`]:
https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.MetadataLoader.html
+//! [`decode_footer`]: crate::file::footer::decode_footer
+//! [`decode_metadata`]: crate::file::footer::decode_metadata
+//!
+//! Writing:
+//! * Write `ParquetMetaData` to bytes in memory: Not yet supported (see
[#6002])
+//! * Writes `ParquetMetaData` to an async target: Not yet supported
+//!
+//! [#6002]: https://github.com/apache/arrow-rs/issues/6002
+//!
+//! # Metadata Encodings and Structures
+//!
+//! There are three different encodings of Parquet Metadata in this crate:
+//!
+//! 1. `bytes`:encoded with the Thrift TCompactProtocol as defined in
+//! [parquet.thrift]
+//!
+//! 2. [`format`]: Rust structures automatically generated by the thrift
compiler
+//! from [parquet.thrift]. These structures are low level and mirror
+//! the thrift definitions.
+//!
+//! 3. [`file::metadata`] (this module): Easier to use Rust structures
+//! with a more idiomatic API. Note that, confusingly, some but not all
+//! of these structures have the same name as the [`format`] structures.
+//!
+//! [`format`]: crate::format
+//! [`file::metadata`]: crate::file::metadata
+//! [parquet.thrift]:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
+//!
+//! Graphically, this is how the different structures relate to each other:
+//!
+//! ```text
+//! ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┌ ─ ─ ─ ─ ─ ─ ─ ─
─ ─ ─ ─ ─
+//! ┌──────────────┐ │
┌───────────────────────┐ │
+//! │ │ ColumnIndex │ ││
ParquetMetaData │
+//! └──────────────┘ │
└───────────────────────┘ │
+//! ┌──────────────┐ │ ┌────────────────┐
│┌───────────────────────┐
+//! │ ..0x24.. │ ◀────▶ │ OffsetIndex │ │ ◀────▶ │
ParquetMetaData │ │
+//! └──────────────┘ │ └────────────────┘
│└───────────────────────┘
+//! ... │ ...
│
+//! │ ┌──────────────────┐ │
┌──────────────────┐
+//! bytes │ FileMetaData* │ │ │
FileMetaData* │ │
+//! (thrift encoded) │ └──────────────────┘ │
└──────────────────┘
+//! ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ ─ ─ ─ ─ ─ ─ ─ ─
─ ─ ─ ─ ─ ┘
+//!
+//! format::meta structures file::metadata
structures
+//!
+//! * Same name, different struct
+//! ```
mod memory;
use std::ops::Range;