jackye1995 commented on a change in pull request #3482: URL: https://github.com/apache/iceberg/pull/3482#discussion_r754635013
########## File path: site/docs/spark-queries.md ########## @@ -191,57 +192,166 @@ join prod.db.table.snapshots s on h.snapshot_id = s.snapshot_id order by made_current_at ``` -```text -+-------------------------+-----------+----------------+---------------------+----------------------------------+ -| made_current_at | operation | snapshot_id | is_current_ancestor | summary[spark.app.id] | -+-------------------------+-----------+----------------+---------------------+----------------------------------+ + +<div class="markdown-table-container" markdown="block"> +| made_current_at | operation | snapshot_id | is_current_ancestor | summary[spark.app.id] | +| -- | -- | -- | -- | -- | | 2019-02-08 03:29:51.215 | append | 57897183625154 | true | application_1520379288616_155055 | | 2019-02-09 16:24:30.13 | delete | 29641004024753 | false | application_1520379288616_151109 | | 2019-02-09 16:32:47.336 | append | 57897183625154 | true | application_1520379288616_155055 | | 2019-02-08 03:47:55.948 | overwrite | 51792995261850 | true | application_1520379288616_152431 | -+-------------------------+-----------+----------------+---------------------+----------------------------------+ -``` +</div> ### Files -To show a table's data files and each file's metadata, run: +To show a table's current data files and each file's metadata, run: ```sql SELECT * FROM prod.db.table.files ``` -```text -+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+ -| file_path | file_format | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | -+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+ + +<div class="markdown-table-container" markdown="block"> +| file_path | file_format | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | +| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | | s3:/.../table/data/00000-3-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> c] | [1 -> , 2 -> c] | null | [4] | | s3:/.../table/data/00001-4-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> b] | [1 -> , 2 -> b] | null | [4] | | s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> a] | [1 -> , 2 -> a] | null | [4] | -+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+ -``` +</div> ### Manifests -To show a table's file manifests and each file's metadata, run: +To show a table's current file manifests and each file's metadata, run: ```sql SELECT * FROM prod.db.table.manifests ``` -```text -+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+ -| path | length | partition_spec_id | added_snapshot_id | added_data_files_count | existing_data_files_count | deleted_data_files_count | partition_summaries | -+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+ + +<div class="markdown-table-container" markdown="block"> +| path | length | partition_spec_id | added_snapshot_id | added_data_files_count | existing_data_files_count | deleted_data_files_count | partition_summaries | +| -- | -- | -- | -- | -- | -- | -- | -- | | s3://.../table/metadata/45b5290b-ee61-4788-b324-b1e2735c0e10-m0.avro | 4479 | 0 | 6668963634911763636 | 8 | 0 | 0 | [[false,null,2019-05-13,2019-05-15]] | -+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+ +</div> + +!!! Note + 1. Fields within `partition_summaries` column of the manifests table correspond to `field_summary` structs within [manifest list](./spec.md#manifest-lists), with the following order: + - `contains_null` + - `contains_nan` + - `lower_bound` + - `upper_bound` + 2. `contains_nan` could return null, which indicates that this information is not available from files' metadata. + This usually occurs when reading from V1 table, where `contains_nan` is not populated. + +### Partitions + +To show a table's current partitions + +```sql +SELECT * FROM prod.db.table.partitions ``` -Note: -1. Fields within `partition_summaries` column of the manifests table correspond to `field_summary` structs within [manifest list](./spec.md#manifest-lists), with the following order: - - `contains_null` - - `contains_nan` - - `lower_bound` - - `upper_bound` -2. `contains_nan` could return null, which indicates that this information is not available from files' metadata. - This usually occurs when reading from V1 table, where `contains_nan` is not populated. +<div class="markdown-table-container" markdown="block"> +| partition | record_count | file_count | +| -- | -- | -- | +| {20211001, 11}| 1| 1| +| {20211002, 11}| 1| 1| +| {20211001, 10}| 1| 1| +| {20211002, 10}| 1| 1| +</div> + +### All Metadata + +!!! Note + The metadata is readable from any snapshot currently tracked by the table + +!!! WARNING + The table's metadata may return **duplicate** rows + +#### All Data Files + +To show a table's valid data files and each file's metadata, run: + +```sql +SELECT * FROM prod.db.table.all_data_files +``` + +<div class="markdown-table-container" markdown="block"> +| content | file_path | file_format | partition | record_count | file_size_in_bytes | column_sizes| value_counts | null_value_counts | nan_value_counts| lower_bounds| upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id| +| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | +| 0|s3:.../dt=20210102/xxx.parquet| PARQUET|{20210102}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210102}|{1 -> 2, 2 -> 20210102}| null| [4]| null| 0| +| 0|s3:.../dt=20210103/xxx.parquet| PARQUET|{20210103}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210103}|{1 -> 3, 2 -> 20210103}| null| [4]| null| 0| +| 0|s3:.../dt=20210104/xxx.parquet| PARQUET|{20210104}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210104}|{1 -> 3, 2 -> 20210104}| null| [4]| null| 0| +</div> + +#### All Manifests + +To show a table's valid file manifests and each file's metadata, run: + +```sql +SELECT * FROM prod.db.table.all_manifests +``` + +<div class="markdown-table-container" markdown="block"> +| path | length | partition_spec_id | added_snapshot_id | added_data_files_count | existing_data_files_count | deleted_data_files_count| partition_summaries| +| -- | -- | -- | -- | -- | -- | -- | -- | +| s3:/.../table/data/dt=20210102/xxx.parquet | 6376 | 0 | 6272782676904868561 | 2 | 0 | 0 |[{false, false, 10, 11}, {false, false, 20210101, 20210101}]| +</div> + +!!! Note + 1. A valid manifest file is one that is referenced from any snapshot currently tracked by the table. + 2. Fields within `partition_summaries` column of the manifests table correspond to `field_summary` structs within [manifest list](./spec.md#manifest-lists), with the following order: + - `contains_null` + - `contains_nan` + - `lower_bound` + - `upper_bound` + 3. `contains_nan` could return null, which indicates that this information is not available from files' metadata. + This usually occurs when reading from V1 table, where `contains_nan` is not populated. + +### Advanced Usage + +#### Entries + +To show a table's current manifest entries as rows, for both delete and data files, run: + +```sql +SELECT * FROM prod.db.table.entries +``` + +<div class="markdown-table-container" markdown="block"> +| status|snapshot_id | sequence_number | data_file | +| -- | -- | -- | -- | +|0 |7462238160765527919|0 |{0, s3://.../dt=20211001/age=10/00000-0-38e4b886-b445-40a9-8db0-58653a331aba-00001.parquet, PARQUET, {20211001, 10}, 1, 1132, {1 -> 47, 2 -> 52, 3 -> 47, 4 -> 55}, {1 -> 1, 2 -> 1, 3 -> 1, 4 -> 1}, {1 -> 0, 2 -> 0, 3 -> 0, 4 -> 0}, {}, {1 ->