dramaticlly commented on code in PR #15630: URL: https://github.com/apache/iceberg/pull/15630#discussion_r3276397036
########## format/spec.md: ########## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: + +* **Absolute path** -- A path string that starts with a [URI scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., `s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without modification. +* **Relative path** -- A path string that does not start with a URI scheme. Relative paths must be resolved against the table's base location before use. + +Prior to v4, all path fields must contain fully-qualified paths. Starting with v4, path fields may contain either absolute or relative paths. [Relative resolution within a URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and `..`) and other file system navigation conventions are not supported in relative paths. + +#### Path Resolution + +Path resolution is the process of producing an absolute path from a relative path by combining it with the table's base location: + +* If the path contains a URI scheme, it is absolute and is used without modification. +* If the path does not contain a URI scheme, the resolved path is the table location followed by the relative path joined by the URI separator character `/`. + +The relative portion is joined to the prefix (table location) without consideration of any additional separator characters. The recommended convention for table location is to not end in a path separator because the join process would add a second separator character. (See example below). + +Paths in manifests produced prior to v4 are fully-qualified and must be produced with a URI scheme if the scheme was omitted to be consistent with V4 paths. + +Examples of path resolution: + +| | Format Version | Table Location | File Path | Resolved Path | Description | +|---------------------|----------------|-----------------------|-------------------------------------------|--------------------------------------------|-------------------------------------| +| Relative Path | v4 | s3://bucket/db/table | data/00000-0.parquet | s3://bucket/db/table/data/00000-0.parquet | Path parts are joined on `/` | +| Absolute Path | v4 | s3://bucket/db/table | hdfs://wh/db/table/data/00000-0.parquet | hdfs://wh/db/table/data/00000-0.parquet | Absolute path is used | +| Duplicate separator | v4 | s3://bucket/db/table/ | data/00000-0.parquet | s3://bucket/db/table//data/00000-0.parquet | Join results in duplicate `//` | +| Duplicate separator | v4 | s3://bucket/db/table | /data/00000-0.parquet | s3://bucket/db/table//data/00000-0.parquet | Join results in duplicate `//` | +| Fully-qualified | v3 and earlier | s3://bucket/db/table | s3://bucket/db/table/data/00000-0.parquet | s3://bucket/db/table/data/00000-0.parquet | Fully-qualified path is used | +| Missing scheme | v3 and earlier | /wh/db/table | /wh/db/table/data/00000-0.parquet | hdfs://wh/db/table/data/00000-0.parquet | Scheme is prepended for consistency | + +#### Path Relativization + +Path relativization is the process of converting an absolute path to a relative path by removing the table location prefix. This is used when persisting paths to metadata files. + +* If an absolute path starts with the table location immediately followed by a separator character, the relative path is the remainder of the string after the separator character. +* If an absolute path does not start with the table location immediately followed by the separator character, it is stored as an absolute path. Review Comment: nit: It might be helpful to explicitly highlight `stored as an absolute path without modification` ########## format/spec.md: ########## @@ -168,6 +184,48 @@ All columns must be written to data files even if they introduce redundancy with Writers are not allowed to commit files with a partition spec that contains a field with an unknown transform. +### Paths in Metadata + +Path strings stored in Iceberg metadata location fields are classified as one of two types: Review Comment: Nit: There are a few references to "fully qualified path" later in the context of v3 and prior, without it being explicitly defined. Since we're classifying paths into two types below, it might be worth briefly noting that fully qualified paths from v3 and prior are considered absolute paths. This could help connect the dots more easily. ########## format/spec.md: ########## @@ -1647,6 +1733,30 @@ The binary single-value serialization can be used to store the lower and upper b ## Appendix E: Format version changes +### Version 4 + +Relative path support is added in v4. + +Reading v3 metadata for v4: Review Comment: nit: v3 or prior metadata for v4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
