This is an automated email from the ASF dual-hosted git repository.
yyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/iceberg.git
The following commit(s) were added to refs/heads/master by this push:
new e8ca53b Core: fix NPE in manifests table for contains_nan column,
update spec (#2521)
e8ca53b is described below
commit e8ca53b9794245fc89ce87cbf6630679c1301298
Author: yyanyy <[email protected]>
AuthorDate: Wed Apr 28 14:32:59 2021 -0700
Core: fix NPE in manifests table for contains_nan column, update spec
(#2521)
---
.../java/org/apache/iceberg/ManifestsTable.java | 2 +-
site/docs/spark-queries.md | 33 ++++++++++++++--------
site/docs/spec.md | 3 +-
3 files changed, 24 insertions(+), 14 deletions(-)
diff --git a/core/src/main/java/org/apache/iceberg/ManifestsTable.java
b/core/src/main/java/org/apache/iceberg/ManifestsTable.java
index e7b9222..c75daf6 100644
--- a/core/src/main/java/org/apache/iceberg/ManifestsTable.java
+++ b/core/src/main/java/org/apache/iceberg/ManifestsTable.java
@@ -38,7 +38,7 @@ public class ManifestsTable extends BaseMetadataTable {
Types.NestedField.required(7, "deleted_data_files_count",
Types.IntegerType.get()),
Types.NestedField.required(8, "partition_summaries",
Types.ListType.ofRequired(9, Types.StructType.of(
Types.NestedField.required(10, "contains_null",
Types.BooleanType.get()),
- Types.NestedField.required(11, "contains_nan",
Types.BooleanType.get()),
+ Types.NestedField.optional(11, "contains_nan",
Types.BooleanType.get()),
Types.NestedField.optional(12, "lower_bound",
Types.StringType.get()),
Types.NestedField.optional(13, "upper_bound", Types.StringType.get())
)))
diff --git a/site/docs/spark-queries.md b/site/docs/spark-queries.md
index f7a78b5..687606f 100644
--- a/site/docs/spark-queries.md
+++ b/site/docs/spark-queries.md
@@ -210,13 +210,13 @@ To show a table's data files and each file's metadata,
run:
SELECT * FROM prod.db.table.files
```
```text
-+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+-----------------+-----------------+--------------+---------------+
-| file_path |
file_format | record_count | file_size_in_bytes | column_sizes |
value_counts | null_value_counts | lower_bounds | upper_bounds |
key_metadata | split_offsets |
-+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+-----------------+-----------------+--------------+---------------+
-| s3:/.../table/data/00000-3-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet |
PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1,
2 -> 1] | [1 -> 0, 2 -> 0] | [1 -> , 2 -> c] | [1 -> , 2 -> c] | null
| [4] |
-| s3:/.../table/data/00001-4-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet |
PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1,
2 -> 1] | [1 -> 0, 2 -> 0] | [1 -> , 2 -> b] | [1 -> , 2 -> b] | null
| [4] |
-| s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet |
PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1,
2 -> 1] | [1 -> 0, 2 -> 0] | [1 -> , 2 -> a] | [1 -> , 2 -> a] | null
| [4] |
-+-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+-----------------+-----------------+--------------+---------------+
++-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
+| file_path |
file_format | record_count | file_size_in_bytes | column_sizes |
value_counts | null_value_counts | nan_value_counts | lower_bounds |
upper_bounds | key_metadata | split_offsets |
++-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
+| s3:/.../table/data/00000-3-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet |
PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1,
2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> c] | [1 -> , 2 ->
c] | null | [4] |
+| s3:/.../table/data/00001-4-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet |
PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1,
2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> b] | [1 -> , 2 ->
b] | null | [4] |
+| s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet |
PARQUET | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1,
2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> a] | [1 -> , 2 ->
a] | null | [4] |
++-------------------------------------------------------------------------+-------------+--------------+--------------------+--------------------+------------------+-------------------+------------------+-----------------+-----------------+--------------+---------------+
```
### Manifests
@@ -227,13 +227,22 @@ To show a table's file manifests and each file's
metadata, run:
SELECT * FROM prod.db.table.manifests
```
```text
-+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+
-| path |
length | partition_spec_id | added_snapshot_id | added_data_files_count |
existing_data_files_count | deleted_data_files_count | partitions
|
-+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+
-| s3://.../table/metadata/45b5290b-ee61-4788-b324-b1e2735c0e10-m0.avro | 4479
| 0 | 6668963634911763636 | 8 | 0
| 0 | [[false,2019-05-13,2019-05-15]] |
-+----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+---------------------------------+
++----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+
+| path |
length | partition_spec_id | added_snapshot_id | added_data_files_count |
existing_data_files_count | deleted_data_files_count | partition_summaries
|
++----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+
+| s3://.../table/metadata/45b5290b-ee61-4788-b324-b1e2735c0e10-m0.avro | 4479
| 0 | 6668963634911763636 | 8 | 0
| 0 |
[[false,null,2019-05-13,2019-05-15]] |
++----------------------------------------------------------------------+--------+-------------------+---------------------+------------------------+---------------------------+--------------------------+--------------------------------------+
```
+Note:
+1. Fields within `partition_summaries` column of the manifests table
correspond to `field_summary` structs within [manifest
list](./spec.md#manifest-lists), with the following order:
+ - `contains_null`
+ - `contains_nan`
+ - `lower_bound`
+ - `upper_bound`
+2. `contains_nan` could return null, which indicates that this information is
not available from files' metadata.
+ This usually occurs when reading from V1 table, where `contains_nan` is not
populated.
+
## Inspecting with DataFrames
Metadata tables can be loaded in Spark 2.4 or Spark 3 using the
DataFrameReader API:
diff --git a/site/docs/spec.md b/site/docs/spec.md
index 6bcfd37..fa16261 100644
--- a/site/docs/spec.md
+++ b/site/docs/spec.md
@@ -431,6 +431,7 @@ Manifest list files store `manifest_file`, a struct with
the following fields:
| v1 | v2 | Field id, name | Type |
Description |
| ---------- | ----------
|-------------------------|---------------|-------------|
| _required_ | _required_ | **`509 contains_null`** | `boolean` | Whether
the manifest contains at least one partition with a null value for the field |
+| _optional_ | _required_ | **`518 contains_nan`** | `boolean` | Whether
the manifest contains at least one partition with a NaN value for the field |
| _optional_ | _optional_ | **`510 lower_bound`** | `bytes` [1] | Lower
bound for the non-null, non-NaN values in the partition field, or null if all
values are null or NaN [2] |
| _optional_ | _optional_ | **`511 upper_bound`** | `bytes` [1] | Upper
bound for the non-null, non-NaN values in the partition field, or null if all
values are null or NaN [2] |
@@ -952,7 +953,7 @@ Writing v2 metadata:
* Table metadata now requires field `default-spec-id`.
* Table metadata now requires field `last-partition-id`.
* Table metadata field `partition-spec` is no longer required and may be
omitted.
-* Snapshot added required field field `sequence-number`.
+* Snapshot added required field `sequence-number`.
* Snapshot now requires field `manifest-list`.
* Snapshot field `manifests` is no longer allowed.
* Table metadata now requires field `sort-orders`.