[GitHub] [iceberg] yyanyy commented on a change in pull request #2762: Update v2 change summary

GitBox Fri, 02 Jul 2021 19:21:32 -0700


yyanyy commented on a change in pull request #2762:
URL: https://github.com/apache/iceberg/pull/2762#discussion_r663300772




##########
File path: site/docs/spec.md
##########
@@ -979,30 +979,88 @@ This serialization scheme is for storing single values as 
individual binary valu
 
 ## Format version changes
 
+Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding metadata files 
to a table with the given version.
+
+| Requirement | Write behavior |
+|-------------|----------------|
+| (blank)     | The field should be omitted |
+| _optional_  | The field can be written |
+| _required_  | The field must be written |
+
+Readers should be more permissive because v1 metadata files are allowed in v2 
tables so that tables can be upgraded to v2 without rewriting the metadata 
tree. For manifest list and manifest files, this table shows the expected v2 
read behavior:
+
+| v1         | v2         | v2 read behavior |
+|------------|------------|------------------|
+|            | _optional_ | Read the field as _optional_ |
+|            | _required_ | Read the field as _optional_; it may be missing in 
v1 files |
+| _optional_ |            | Ignore the field |
+| _optional_ | _optional_ | Read the field as _optional_ |
+| _optional_ | _required_ | Read the field as _optional_; it may be missing in 
v1 files |
+| _required_ |            | Ignore the field |
+| _required_ | _optional_ | Read the field as _optional_ |
+| _required_ | _required_ | Fill in a default or throw an exception if the 
field is missing |
+
+Readers may be more strict for metadata JSON files because a v1 JSON file will 
not appear in a v2 table. Required v2 fields that were not present in v1 or 
optional in v1 may be handled as required fields.

Review comment:
       The second sentence may seem a bit unintuitive, does "handled as 
required fields" mean the v2 reader should fail the read if the field that's 
required in v2 but missing/optional in v1 cannot be found in the metadata json 
file?

##########
File path: site/docs/spec.md
##########
@@ -979,30 +979,88 @@ This serialization scheme is for storing single values as 
individual binary valu
 
 ## Format version changes
 
+Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding metadata files 
to a table with the given version.
+
+| Requirement | Write behavior |

Review comment:
       wondering if we should put these tables to the beginning of this page, 
as these keywords are used throughout the page, especially on the difference in 
reading requirement between metadata files vs non-metadata files

##########
File path: site/docs/spec.md
##########
@@ -979,30 +979,88 @@ This serialization scheme is for storing single values as 
individual binary valu
 
 ## Format version changes
 
+Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding metadata files 
to a table with the given version.
+
+| Requirement | Write behavior |
+|-------------|----------------|
+| (blank)     | The field should be omitted |
+| _optional_  | The field can be written |
+| _required_  | The field must be written |
+
+Readers should be more permissive because v1 metadata files are allowed in v2 
tables so that tables can be upgraded to v2 without rewriting the metadata 
tree. For manifest list and manifest files, this table shows the expected v2 
read behavior:
+
+| v1         | v2         | v2 read behavior |
+|------------|------------|------------------|
+|            | _optional_ | Read the field as _optional_ |
+|            | _required_ | Read the field as _optional_; it may be missing in 
v1 files |
+| _optional_ |            | Ignore the field |
+| _optional_ | _optional_ | Read the field as _optional_ |
+| _optional_ | _required_ | Read the field as _optional_; it may be missing in 
v1 files |
+| _required_ |            | Ignore the field |
+| _required_ | _optional_ | Read the field as _optional_ |
+| _required_ | _required_ | Fill in a default or throw an exception if the 
field is missing |
+
+Readers may be more strict for metadata JSON files because a v1 JSON file will 
not appear in a v2 table. Required v2 fields that were not present in v1 or 
optional in v1 may be handled as required fields.
+
 ### Version 2
 
 Writing v1 metadata:
 
-* Table metadata field `last-sequence-number` should not be written.
-* Snapshot field `sequence-number` should not be written.
+* Table metadata field `last-sequence-number` should not be written
+* Snapshot field `sequence-number` should not be written
+* Manifest entry field `sequence_number` should not be written
 
-Reading v1 metadata:
+Reading v1 metadata for v2:
 
-* Table metadata field `last-sequence-number` must default to 0.
-* Snapshot field `sequence-number` must default to 0.
+* Table metadata field `last-sequence-number` must default to 0
+* Snapshot field `sequence-number` must default to 0
+* Manifest list field `content` must default to 0 (data)
+* Data file field `content` must default to 0 (data)
 
 Writing v2 metadata:
 
-* Table metadata added required field `last-sequence-number`.
-* Table metadata now requires field `table-uuid`.
-* Table metadata now requires field `partition-specs`.
-* Table metadata now requires field `default-spec-id`.
-* Table metadata now requires field `last-partition-id`.
-* Table metadata field `partition-spec` is no longer required and may be 
omitted.
-* Snapshot added required field `sequence-number`.
-* Snapshot now requires field `manifest-list`.
-* Snapshot field `manifests` is no longer allowed.
-* Table metadata now requires field `sort-orders`.
-* Table metadata now requires field `default-sort-order-id`.
+* Table metadata JSON:
+    * `last-sequence-number` was added and is required; default to 0 when 
reading v1 metadata
+    * `table-uuid` is now required
+    * `current-schema-id` is now required
+    * `schemas` is now required
+    * `partition-specs` is now required
+    * `default-spec-id` is now required
+    * `last-partition-id` is now required
+    * `sort-orders` is now required
+    * `default-sort-order-id` is now required

Review comment:
       Unrelated to this specific PR, but just wondering; if a table does not 
have sort order defined, in `sort-orders` it will have an `unsorted` with id=0? 
Is this mostly to prevent having a `sort-orders` but without default id or vice 
versa? 

##########
File path: site/docs/spec.md
##########
@@ -979,30 +979,88 @@ This serialization scheme is for storing single values as 
individual binary valu
 
 ## Format version changes
 
+Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding metadata files 
to a table with the given version.
+
+| Requirement | Write behavior |
+|-------------|----------------|
+| (blank)     | The field should be omitted |
+| _optional_  | The field can be written |
+| _required_  | The field must be written |
+
+Readers should be more permissive because v1 metadata files are allowed in v2 
tables so that tables can be upgraded to v2 without rewriting the metadata 
tree. For manifest list and manifest files, this table shows the expected v2 
read behavior:
+
+| v1         | v2         | v2 read behavior |
+|------------|------------|------------------|
+|            | _optional_ | Read the field as _optional_ |
+|            | _required_ | Read the field as _optional_; it may be missing in 
v1 files |
+| _optional_ |            | Ignore the field |
+| _optional_ | _optional_ | Read the field as _optional_ |
+| _optional_ | _required_ | Read the field as _optional_; it may be missing in 
v1 files |
+| _required_ |            | Ignore the field |
+| _required_ | _optional_ | Read the field as _optional_ |
+| _required_ | _required_ | Fill in a default or throw an exception if the 
field is missing |
+
+Readers may be more strict for metadata JSON files because a v1 JSON file will 
not appear in a v2 table. Required v2 fields that were not present in v1 or 
optional in v1 may be handled as required fields.
+
 ### Version 2
 
 Writing v1 metadata:
 
-* Table metadata field `last-sequence-number` should not be written.
-* Snapshot field `sequence-number` should not be written.
+* Table metadata field `last-sequence-number` should not be written
+* Snapshot field `sequence-number` should not be written
+* Manifest entry field `sequence_number` should not be written
 
-Reading v1 metadata:
+Reading v1 metadata for v2:
 
-* Table metadata field `last-sequence-number` must default to 0.
-* Snapshot field `sequence-number` must default to 0.
+* Table metadata field `last-sequence-number` must default to 0
+* Snapshot field `sequence-number` must default to 0
+* Manifest list field `content` must default to 0 (data)
+* Data file field `content` must default to 0 (data)
 
 Writing v2 metadata:
 
-* Table metadata added required field `last-sequence-number`.
-* Table metadata now requires field `table-uuid`.
-* Table metadata now requires field `partition-specs`.
-* Table metadata now requires field `default-spec-id`.
-* Table metadata now requires field `last-partition-id`.
-* Table metadata field `partition-spec` is no longer required and may be 
omitted.
-* Snapshot added required field `sequence-number`.
-* Snapshot now requires field `manifest-list`.
-* Snapshot field `manifests` is no longer allowed.
-* Table metadata now requires field `sort-orders`.
-* Table metadata now requires field `default-sort-order-id`.
+* Table metadata JSON:
+    * `last-sequence-number` was added and is required; default to 0 when 
reading v1 metadata

Review comment:
       nit: "default to 0 when reading v1 metadata" seems to already be 
mentioned in earlier section? 

##########
File path: site/docs/spec.md
##########
@@ -979,30 +979,88 @@ This serialization scheme is for storing single values as 
individual binary valu
 
 ## Format version changes
 
+Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding metadata files 
to a table with the given version.
+
+| Requirement | Write behavior |
+|-------------|----------------|
+| (blank)     | The field should be omitted |
+| _optional_  | The field can be written |
+| _required_  | The field must be written |
+
+Readers should be more permissive because v1 metadata files are allowed in v2 
tables so that tables can be upgraded to v2 without rewriting the metadata 
tree. For manifest list and manifest files, this table shows the expected v2 
read behavior:
+
+| v1         | v2         | v2 read behavior |
+|------------|------------|------------------|
+|            | _optional_ | Read the field as _optional_ |
+|            | _required_ | Read the field as _optional_; it may be missing in 
v1 files |
+| _optional_ |            | Ignore the field |
+| _optional_ | _optional_ | Read the field as _optional_ |
+| _optional_ | _required_ | Read the field as _optional_; it may be missing in 
v1 files |
+| _required_ |            | Ignore the field |
+| _required_ | _optional_ | Read the field as _optional_ |
+| _required_ | _required_ | Fill in a default or throw an exception if the 
field is missing |
+
+Readers may be more strict for metadata JSON files because a v1 JSON file will 
not appear in a v2 table. Required v2 fields that were not present in v1 or 
optional in v1 may be handled as required fields.
+
 ### Version 2
 
 Writing v1 metadata:
 
-* Table metadata field `last-sequence-number` should not be written.
-* Snapshot field `sequence-number` should not be written.
+* Table metadata field `last-sequence-number` should not be written
+* Snapshot field `sequence-number` should not be written
+* Manifest entry field `sequence_number` should not be written

Review comment:
       Looks like this might be to avoid v2 table writing these data to some of 
the v1 files, which got read by v2 reader and mess up with the apply order for 
delete files for later? I wonder if we want to mention the 
`sequence_number`/`min_sequence_number` in manifest_list here,  since v2 
writers will mostly likely be creating manifest list file for v1? 

##########
File path: site/docs/spec.md
##########
@@ -979,30 +979,88 @@ This serialization scheme is for storing single values as 
individual binary valu
 
 ## Format version changes
 
+Some tables in this spec have columns that specify requirements for v1 and v2 
tables. These requirements are intended for writers when adding metadata files 
to a table with the given version.
+
+| Requirement | Write behavior |
+|-------------|----------------|
+| (blank)     | The field should be omitted |
+| _optional_  | The field can be written |
+| _required_  | The field must be written |
+
+Readers should be more permissive because v1 metadata files are allowed in v2 
tables so that tables can be upgraded to v2 without rewriting the metadata 
tree. For manifest list and manifest files, this table shows the expected v2 
read behavior:
+
+| v1         | v2         | v2 read behavior |
+|------------|------------|------------------|
+|            | _optional_ | Read the field as _optional_ |
+|            | _required_ | Read the field as _optional_; it may be missing in 
v1 files |
+| _optional_ |            | Ignore the field |
+| _optional_ | _optional_ | Read the field as _optional_ |
+| _optional_ | _required_ | Read the field as _optional_; it may be missing in 
v1 files |
+| _required_ |            | Ignore the field |
+| _required_ | _optional_ | Read the field as _optional_ |
+| _required_ | _required_ | Fill in a default or throw an exception if the 
field is missing |
+
+Readers may be more strict for metadata JSON files because a v1 JSON file will 
not appear in a v2 table. Required v2 fields that were not present in v1 or 
optional in v1 may be handled as required fields.
+
 ### Version 2
 
 Writing v1 metadata:
 
-* Table metadata field `last-sequence-number` should not be written.
-* Snapshot field `sequence-number` should not be written.
+* Table metadata field `last-sequence-number` should not be written
+* Snapshot field `sequence-number` should not be written
+* Manifest entry field `sequence_number` should not be written
 
-Reading v1 metadata:
+Reading v1 metadata for v2:
 
-* Table metadata field `last-sequence-number` must default to 0.
-* Snapshot field `sequence-number` must default to 0.
+* Table metadata field `last-sequence-number` must default to 0

Review comment:
       Would we want to default every sequence number to 0? e.g. in manifest 
entry and manifest list




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] yyanyy commented on a change in pull request #2762: Update v2 change summary

Reply via email to