mxm commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3022410921
########## format/spec.md: ########## @@ -123,6 +139,35 @@ Tables do not require random-access writes. Once written, data and metadata file Tables do not require rename, except for tables that use atomic rename to implement the commit operation for new metadata files. +### Paths in Metadata + +Path strings stored in Iceberg metadata files are classified as one of two types: + +* **Absolute path** -- A path string that includes a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3://`, `gs://`, `hdfs://`, `file:///`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not include a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain absolute paths. Starting with v4, path fields may contain either absolute or relative paths. Directory navigation symbols (`.` and `..`) and other file system conventions are not supported in relative paths. Review Comment: This issue applies to paths in general, not only to the new relative paths. Today, absolute paths could contain "directory navigation symbols", if the corresponding FileIO treats them like it. I think we could state that paths (relative or absolute) must be normalized: ```suggestion Prior to v4, all path fields must contain absolute paths. Starting with v4, path fields may contain either absolute or relative paths. All paths must be normalized, i.e. special symbols (`.`, `..`, etc.) must be resolved before storing the path. ``` +1 for leaving it up to the FileIO to enforce this. ########## format/spec.md: ########## @@ -1637,6 +1683,31 @@ The binary single-value serialization can be used to store the lower and upper b ## Appendix E: Format version changes +### Version 4 + +Relative path support is added in v4. + +Reading v3 metadata for v4: + +* All path fields are treated as absolute paths +* Table metadata field `location` is always present + +Writing v4 metadata: + +* Table metadata JSON: + * `location` is now optional + * When present, `location` must be an absolute path + * When not present, the table location must be managed externally and provided when loading the metadata +* Path fields in all metadata structures may contain relative paths +* Writers should produce relative paths by default for files that reside under the table location +* Absolute paths must be used for files that do not share a common prefix with the table location + +Reading v4 metadata: + +* Readers must check whether path fields contain a URI scheme to determine if a path is absolute or relative +* Relative paths must be resolved against the table location before use (see [Path Resolution](#path-resolution)) +* When `location` is omitted, the table location must be provided (see [Table Location Specification](#table-location-specification) Review Comment: ```suggestion * When `location` is omitted, the table location must be provided (see [Table Location Specification](#table-location-specification)) ``` NIT missing parenthesis -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
