yyanyy commented on a change in pull request #1141:
URL: https://github.com/apache/iceberg/pull/1141#discussion_r556185977
##########
File path: site/docs/spec.md
##########
@@ -218,10 +218,31 @@ Partition specs capture the transform from table data to
partition values. This
| **`month`** | Extract a date or timestamp month, as months from
1970-01-01 | `date`, `timestamp(tz)`
| `int` |
| **`day`** | Extract a date or timestamp day, as days from 1970-01-01
| `date`, `timestamp(tz)`
| `date` |
| **`hour`** | Extract a timestamp hour, as hours from 1970-01-01
00:00:00 | `timestamp(tz)`
| `int` |
+| **`void`** | Always produces `null` (the void transform)
| Any
| Source type |
All transforms must return `null` for a `null` input value.
+#### Partition Field ID handling
+
+A partition field ID is an integer used to identify a partition field in
Iceberg manifest files.
+Field IDs are required in v2 and optional in v1.
+
+About compatibility between v1 and v2 tables:
+
+* For backward compatibility, if field ids are missing in a table metadata,
the reference implementation will sequentially generate ids for each field
starting at `1000` based on its position in the list of fields.
+* For forward compatibility, if field ids are not supported but present in the
metadata, old versions of the reference implementation will ignore those field
ids and then regenerate an auto-increment field id starting at 1000 for every
partition field.
+
+While working with a v1 table, the v1 partition spec does not require
consistent field IDs and then they are assigned when creating each manifest
file.
+When creating a manifest, each field of the partition spec will be assigned an
ID starting at `1000`, and there is no guarantees about ID reuse across files.
+But as long as the partition spec will not be evolved, IDs will be consistent.
+
+This has a few implications:
+* Older writers may erase partition field IDs when writing to a v1 table. This
does not happen to v2 tables because writers will fail to read or write a v2
table.
Review comment:
Nit: "... will fail to read or write a v2 table when partition field ID
doesn't exist"?
##########
File path: site/docs/spec.md
##########
@@ -218,10 +218,31 @@ Partition specs capture the transform from table data to
partition values. This
| **`month`** | Extract a date or timestamp month, as months from
1970-01-01 | `date`, `timestamp(tz)`
| `int` |
| **`day`** | Extract a date or timestamp day, as days from 1970-01-01
| `date`, `timestamp(tz)`
| `date` |
| **`hour`** | Extract a timestamp hour, as hours from 1970-01-01
00:00:00 | `timestamp(tz)`
| `int` |
+| **`void`** | Always produces `null` (the void transform)
| Any
| Source type |
All transforms must return `null` for a `null` input value.
+#### Partition Field ID handling
+
+A partition field ID is an integer used to identify a partition field in
Iceberg manifest files.
+Field IDs are required in v2 and optional in v1.
+
+About compatibility between v1 and v2 tables:
+
+* For backward compatibility, if field ids are missing in a table metadata,
the reference implementation will sequentially generate ids for each field
starting at `1000` based on its position in the list of fields.
+* For forward compatibility, if field ids are not supported but present in the
metadata, old versions of the reference implementation will ignore those field
ids and then regenerate an auto-increment field id starting at 1000 for every
partition field.
+
+While working with a v1 table, the v1 partition spec does not require
consistent field IDs and then they are assigned when creating each manifest
file.
+When creating a manifest, each field of the partition spec will be assigned an
ID starting at `1000`, and there is no guarantees about ID reuse across files.
+But as long as the partition spec will not be evolved, IDs will be consistent.
+
+This has a few implications:
+* Older writers may erase partition field IDs when writing to a v1 table. This
does not happen to v2 tables because writers will fail to read or write a v2
table.
+* Metadata tables need consistent field IDs across manifest files. To achieve
it, for v1 tables, please evolve the partition spec according to the
recommendations,
+i.e. don't reorder or delete partition fields; replace fields with with `void`
transform; add new fields to the end. Note that renames are OK and also note
that this does not apply for v2 tables.
Review comment:
Nit: might want to clarify why it doesn't apply for v2 tables; it seems
that it's not needed because v2 table is capable of auto identifying the latest
partition field Id across manifests to always assign a new one to the new field
during a partition spec update table commit? If that's the case, it might be
good to mention it here
##########
File path: site/docs/spec.md
##########
@@ -218,10 +218,31 @@ Partition specs capture the transform from table data to
partition values. This
| **`month`** | Extract a date or timestamp month, as months from
1970-01-01 | `date`, `timestamp(tz)`
| `int` |
| **`day`** | Extract a date or timestamp day, as days from 1970-01-01
| `date`, `timestamp(tz)`
| `date` |
| **`hour`** | Extract a timestamp hour, as hours from 1970-01-01
00:00:00 | `timestamp(tz)`
| `int` |
+| **`void`** | Always produces `null` (the void transform)
| Any
| Source type |
Review comment:
Nit: might be good to have an explanation here on why this is needed
(not likely needed in v2, mostly needed in v1 when dropping a partition field
without changing the auto-assigned partition field ids?)
##########
File path: site/docs/spec.md
##########
@@ -218,10 +218,31 @@ Partition specs capture the transform from table data to
partition values. This
| **`month`** | Extract a date or timestamp month, as months from
1970-01-01 | `date`, `timestamp(tz)`
| `int` |
| **`day`** | Extract a date or timestamp day, as days from 1970-01-01
| `date`, `timestamp(tz)`
| `date` |
| **`hour`** | Extract a timestamp hour, as hours from 1970-01-01
00:00:00 | `timestamp(tz)`
| `int` |
+| **`void`** | Always produces `null` (the void transform)
| Any
| Source type |
All transforms must return `null` for a `null` input value.
+#### Partition Field ID handling
+
+A partition field ID is an integer used to identify a partition field in
Iceberg manifest files.
+Field IDs are required in v2 and optional in v1.
+
+About compatibility between v1 and v2 tables:
+
+* For backward compatibility, if field ids are missing in a table metadata,
the reference implementation will sequentially generate ids for each field
starting at `1000` based on its position in the list of fields.
+* For forward compatibility, if field ids are not supported but present in the
metadata, old versions of the reference implementation will ignore those field
ids and then regenerate an auto-increment field id starting at 1000 for every
partition field.
+
+While working with a v1 table, the v1 partition spec does not require
consistent field IDs and then they are assigned when creating each manifest
file.
+When creating a manifest, each field of the partition spec will be assigned an
ID starting at `1000`, and there is no guarantees about ID reuse across files.
+But as long as the partition spec will not be evolved, IDs will be consistent.
+
+This has a few implications:
+* Older writers may erase partition field IDs when writing to a v1 table. This
does not happen to v2 tables because writers will fail to read or write a v2
table.
+* Metadata tables need consistent field IDs across manifest files. To achieve
it, for v1 tables, please evolve the partition spec according to the
recommendations,
Review comment:
Nit: I wonder if we want to expand a little bit on "Metadata tables need
consistent field IDs across manifest files", which I think is to ensure
correctness when looking up metadata tables, the same ID always refers to the
same partition field/no ID reuses, and this mapping between ID and partition
field is immutable even after partition spec evolution.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]