rdblue commented on code in PR #15630:
URL: https://github.com/apache/iceberg/pull/15630#discussion_r3024918968


##########
format/spec.md:
##########
@@ -123,6 +138,35 @@ Tables do not require random-access writes. Once written, 
data and metadata file
 
 Tables do not require rename, except for tables that use atomic rename to 
implement the commit operation for new metadata files.
 
+### Paths in Metadata
+
+Path strings stored in Iceberg metadata files are classified as one of two 
types:
+
+* **Absolute path** -- A path string that includes a [URI 
scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., 
`s3://`, `gs://`, `hdfs://`, `file:///`). Absolute paths are used as-is without 
modification.
+* **Relative path** -- A path string that does not include a URI scheme. 
Relative paths must be resolved against the table's base location before use.
+
+Prior to v4, all path fields must contain absolute paths. Starting with v4, 
path fields may contain either absolute or relative paths. Directory navigation 
symbols (`.` and `..`) and other file system conventions are not supported in 
relative paths.
+
+#### Path Resolution
+
+Path resolution is the process of producing an absolute path from a relative 
path by combining it with the table's base location. If a path is absolute, it 
is used as-is. If a path is relative, it is concatenated with the table 
location to produce an absolute path:
+
+* If the path contains a URI scheme, it is absolute and is used without 
modification.
+* If the path does not contain a URI scheme, the resolved path is the table 
location followed by the relative path.
+
+Paths used as prefixes must not end in a path separator. The relative portion 
is appended to the prefix without introduction of any additional separator 
characters.
+
+#### Path Relativization
+
+Path relativization is the process of converting an absolute path to a 
relative path by removing the table location prefix. This is used when 
persisting paths to metadata files.
+
+* If an absolute path starts with the table location, the table location 
prefix is removed and the remaining relative portion is stored.
+* If an absolute path does not start with the table location, it is stored as 
an absolute path.
+
+#### Table Location Specification
+
+When the `location` field is present in table metadata, it is used directly as 
the table's base location. When the `location` field is not present (v4 and 
later), the table location must be provided.  How the table location is 
persisted/determined when not specified in metadata is outside the scope of the 
spec and is left to implementations to define.

Review Comment:
   > It seems that there is no cheap way to know if all metadata paths are 
relative and thus can be moved/replicated without concern.
   
   Another cheap way is to look at the Parquet stats for each manifest file. If 
both lower and upper bound start with `/` then you know it's all relative. I 
think that's a reasonable solution and we need to be careful not to mix other 
concerns into how the top-level location is handled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to