[GitHub] [iceberg] rdblue commented on a change in pull request #3159: Adding documentation for metadata tables

GitBox Wed, 22 Sep 2021 10:55:12 -0700


rdblue commented on a change in pull request #3159:
URL: https://github.com/apache/iceberg/pull/3159#discussion_r714180905




##########
File path: site/docs/metadata.md
##########
@@ -0,0 +1,134 @@
+# Metadata Tables
+
+This page describes the internal metadata tables maintained by Iceberg. Please 
refer to [definitions page](terms.md)
+for more information on terms and definitions and the [specifications 
page](spec.md) for more information on Iceberg's
+table specification. Complete metadata table schema can be found on the [Spark 
Queries page](spark-queries.md#metadata-table-schema). 
+
+| Name                                              | Description |
+| --------------------------------------------------| ------------|
+| [`AllDataFilesTable`](#AllDataFilesTable)         | Contains rows 
representing all of the data files in the table. Each row will contain metadata 
as well as path information stored by the Iceberg. This differs from the 
`DataFilesTable` because it contains all files currently referenced by any 
existing Snapshot from this table rather than just the current one.
+| [`AllEntriesTable`](#AllEntriesTable)             | Contains a table's 
manifest entries as rows, for both delete and data files. Please note that this 
table exposes internal details, like files that have been deleted. For a table 
of the live data files, please use `DataFilesTable`.
+| [`AllManifestsTable`](#AllManifestsTable)         | Contains a table's valid 
manifest files as rows. A valid manifest file is referenced from any snapshot 
currently tracked by the table. This table may contain duplicate rows. 
+| [`DataFilesTable`](#DataFilesTable)               | Contains a table's data 
files as rows.
+| [`HistoryTable`](#HistoryTable)                   | Contains a table's 
history as rows. History is based on the table's snapshot log, which logs each 
update to the table's current snapshot.
+| [`ManifestEntriesTable`](#ManifestEntriesTable)   | Contains a table's 
manifest entries as rows, for both delete and data files. Please note that this 
table exposes internal details, like files that have been deleted. For a table 
of the live data files, please use `DataFilesTable`.
+| [`ManifestsTable`](#ManifestsTable)               | Contains a table's 
manifest files as rows.
+| [`PartitionsTable`](#PartitionsTable)             | Contains a table's 
partitions as rows.
+| [`SnapshotsTable`](#SnapshotsTable)               | Contains a table's known 
snapshots as rows. This does not include snapshots that have been expired using 
[`ExpireSnapshots`](https://iceberg.apache.org/javadoc/master/org/apache/iceberg/ExpireSnapshots.html).
+
+
+## Table Schema
+
+### <a id="AllDataFilesTable"></a> 1. `AllDataFilesTable`
+
+| Column name           | Required  | Data type         | Description |
+|-----------------------|-----------|-------------------|-------------|
+| content               |           | int               | Contents of the 
file: 0=data, 1=position deletes, 2=equality deletes
+| file_path             | ✔️        | string            | Location URI with FS 
scheme
+| file_format           | ✔️        | string            | File format name: 
avro, orc, or parquet
+| partition             | ✔️        | `struct<...>`     | Partition data 
tuple, schema based on the partition spec
+| record_count          | ✔️        | long              | Number of records in 
the file
+| file_size_in_bytes    | ✔️        | long              | Total file size in 
bytes
+| column_sizes          | ️         | `map<int, long>`  | Map of column id to 
total size on disk
+| value_counts          | ️         | `map<int, long>`  | Map of column id to 
total count, including null and NaN
+| null_value_counts     | ️         | `map<int, long>`  | Map of column id to 
null value count
+| nan_value_counts      |           | `map<int, long>`  | Map of column id to 
number of NaN values in the column
+| lower_bounds          |           | `map<int, binary>`| Map of column id to 
lower bound
+| upper_bounds          |           | `map<int, binary>`| Map of column id to 
upper bound
+| key_metadata          |           | binary            | Encryption key 
metadata blob
+| split_offsets         |           | `list<long>`      | Splittable offsets
+| equality_ids          |           | `list<int>`       | Equality comparison 
field IDs
+| sort_order_id         |           | int               | Sort order ID
+
+### <a id="AllEntriesTable"></a> 2. `AllEntriesTable`

Review comment:
       Anchors are generated automatically, so no need to add them in the 
markdown.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #3159: Adding documentation for metadata tables

Reply via email to