[GitHub] [iceberg] yyanyy commented on a change in pull request #1141: Document field-id requirements for partition specs

GitBox Tue, 12 Jan 2021 17:02:45 -0800


yyanyy commented on a change in pull request #1141:
URL: https://github.com/apache/iceberg/pull/1141#discussion_r556185977




##########
File path: site/docs/spec.md
##########
@@ -218,10 +218,31 @@ Partition specs capture the transform from table data to 
partition values. This
 | **`month`**       | Extract a date or timestamp month, as months from 
1970-01-01 | `date`, `timestamp(tz)`                                            
                                       | `int`       |
 | **`day`**         | Extract a date or timestamp day, as days from 1970-01-01 
    | `date`, `timestamp(tz)`                                                   
                                | `date`      |
 | **`hour`**        | Extract a timestamp hour, as hours from 1970-01-01 
00:00:00  | `timestamp(tz)`                                                     
                                      | `int`       |
+| **`void`**        | Always produces `null` (the void transform)              
    | Any                                                                       
                                | Source type |
 
 All transforms must return `null` for a `null` input value.
 
 
+#### Partition Field ID handling
+
+A partition field ID is an integer used to identify a partition field in 
Iceberg manifest files. 
+Field IDs are required in v2 and optional in v1.
+
+About compatibility between v1 and v2 tables:
+
+* For backward compatibility, if field ids are missing in a table metadata, 
the reference implementation will sequentially generate ids for each field 
starting at `1000` based on its position in the list of fields.
+* For forward compatibility, if field ids are not supported but present in the 
metadata, old versions of the reference implementation will ignore those field 
ids and then regenerate an auto-increment field id starting at 1000 for every 
partition field.
+
+While working with a v1 table, the v1 partition spec does not require 
consistent field IDs and then they are assigned when creating each manifest 
file. 
+When creating a manifest, each field of the partition spec will be assigned an 
ID starting at `1000`, and there is no guarantees about ID reuse across files. 
+But as long as the partition spec will not be evolved, IDs will be consistent.
+
+This has a few implications:
+* Older writers may erase partition field IDs when writing to a v1 table. This 
does not happen to v2 tables because writers will fail to read or write a v2 
table.

Review comment:
       Nit: "... will fail to read or write a v2 table when partition field ID 
doesn't exist"? 

##########
File path: site/docs/spec.md
##########
@@ -218,10 +218,31 @@ Partition specs capture the transform from table data to 
partition values. This
 | **`month`**       | Extract a date or timestamp month, as months from 
1970-01-01 | `date`, `timestamp(tz)`                                            
                                       | `int`       |
 | **`day`**         | Extract a date or timestamp day, as days from 1970-01-01 
    | `date`, `timestamp(tz)`                                                   
                                | `date`      |
 | **`hour`**        | Extract a timestamp hour, as hours from 1970-01-01 
00:00:00  | `timestamp(tz)`                                                     
                                      | `int`       |
+| **`void`**        | Always produces `null` (the void transform)              
    | Any                                                                       
                                | Source type |
 
 All transforms must return `null` for a `null` input value.
 
 
+#### Partition Field ID handling
+
+A partition field ID is an integer used to identify a partition field in 
Iceberg manifest files. 
+Field IDs are required in v2 and optional in v1.
+
+About compatibility between v1 and v2 tables:
+
+* For backward compatibility, if field ids are missing in a table metadata, 
the reference implementation will sequentially generate ids for each field 
starting at `1000` based on its position in the list of fields.
+* For forward compatibility, if field ids are not supported but present in the 
metadata, old versions of the reference implementation will ignore those field 
ids and then regenerate an auto-increment field id starting at 1000 for every 
partition field.
+
+While working with a v1 table, the v1 partition spec does not require 
consistent field IDs and then they are assigned when creating each manifest 
file. 
+When creating a manifest, each field of the partition spec will be assigned an 
ID starting at `1000`, and there is no guarantees about ID reuse across files. 
+But as long as the partition spec will not be evolved, IDs will be consistent.
+
+This has a few implications:
+* Older writers may erase partition field IDs when writing to a v1 table. This 
does not happen to v2 tables because writers will fail to read or write a v2 
table.
+* Metadata tables need consistent field IDs across manifest files. To achieve 
it, for v1 tables, please evolve the partition spec according to the 
recommendations, 
+i.e. don't reorder or delete partition fields; replace fields with with `void` 
transform; add new fields to the end. Note that renames are OK and also note 
that this does not apply for v2 tables.

Review comment:
       Nit: might want to clarify why it doesn't apply for v2 tables; it seems 
that it's not needed because v2 table is capable of auto identifying the latest 
partition field Id across manifests to always assign a new one to the new field 
during a partition spec update table commit? If that's the case, it might be 
good to mention it here

##########
File path: site/docs/spec.md
##########
@@ -218,10 +218,31 @@ Partition specs capture the transform from table data to 
partition values. This
 | **`month`**       | Extract a date or timestamp month, as months from 
1970-01-01 | `date`, `timestamp(tz)`                                            
                                       | `int`       |
 | **`day`**         | Extract a date or timestamp day, as days from 1970-01-01 
    | `date`, `timestamp(tz)`                                                   
                                | `date`      |
 | **`hour`**        | Extract a timestamp hour, as hours from 1970-01-01 
00:00:00  | `timestamp(tz)`                                                     
                                      | `int`       |
+| **`void`**        | Always produces `null` (the void transform)              
    | Any                                                                       
                                | Source type |

Review comment:
       Nit: might be good to have an explanation here on why this is needed 
(not likely needed in v2, mostly needed in v1 when dropping a partition field 
without changing the auto-assigned partition field ids?)

##########
File path: site/docs/spec.md
##########
@@ -218,10 +218,31 @@ Partition specs capture the transform from table data to 
partition values. This
 | **`month`**       | Extract a date or timestamp month, as months from 
1970-01-01 | `date`, `timestamp(tz)`                                            
                                       | `int`       |
 | **`day`**         | Extract a date or timestamp day, as days from 1970-01-01 
    | `date`, `timestamp(tz)`                                                   
                                | `date`      |
 | **`hour`**        | Extract a timestamp hour, as hours from 1970-01-01 
00:00:00  | `timestamp(tz)`                                                     
                                      | `int`       |
+| **`void`**        | Always produces `null` (the void transform)              
    | Any                                                                       
                                | Source type |
 
 All transforms must return `null` for a `null` input value.
 
 
+#### Partition Field ID handling
+
+A partition field ID is an integer used to identify a partition field in 
Iceberg manifest files. 
+Field IDs are required in v2 and optional in v1.
+
+About compatibility between v1 and v2 tables:
+
+* For backward compatibility, if field ids are missing in a table metadata, 
the reference implementation will sequentially generate ids for each field 
starting at `1000` based on its position in the list of fields.
+* For forward compatibility, if field ids are not supported but present in the 
metadata, old versions of the reference implementation will ignore those field 
ids and then regenerate an auto-increment field id starting at 1000 for every 
partition field.
+
+While working with a v1 table, the v1 partition spec does not require 
consistent field IDs and then they are assigned when creating each manifest 
file. 
+When creating a manifest, each field of the partition spec will be assigned an 
ID starting at `1000`, and there is no guarantees about ID reuse across files. 
+But as long as the partition spec will not be evolved, IDs will be consistent.
+
+This has a few implications:
+* Older writers may erase partition field IDs when writing to a v1 table. This 
does not happen to v2 tables because writers will fail to read or write a v2 
table.
+* Metadata tables need consistent field IDs across manifest files. To achieve 
it, for v1 tables, please evolve the partition spec according to the 
recommendations, 

Review comment:
       Nit: I wonder if we want to expand a little bit on "Metadata tables need 
consistent field IDs across manifest files", which I think is to ensure 
correctness when looking up metadata tables, the same ID always refers to the 
same partition field/no ID reuses, and this mapping between ID and partition 
field is immutable even after partition spec evolution. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] yyanyy commented on a change in pull request #1141: Document field-id requirements for partition specs

Reply via email to