stevenzwu commented on code in PR #15630:
URL: https://github.com/apache/iceberg/pull/15630#discussion_r3251040825


##########
format/spec.md:
##########
@@ -168,6 +185,46 @@ All columns must be written to data files even if they 
introduce redundancy with
 
 Writers are not allowed to commit files with a partition spec that contains a 
field with an unknown transform.
 
+### Paths in Metadata
+
+Path strings stored in Iceberg metadata location fields are classified as one 
of two types:
+
+* **Absolute path** -- A path string that includes a [URI 
scheme](https://datatracker.ietf.org/doc/html/rfc3986#section-3.1) (e.g., 
`s3:`, `gs:`, `hdfs:`, `file:`). Absolute paths are used as-is without 
modification.
+* **Relative path** -- A path string that does not include a URI scheme. 
Relative paths must be resolved against the table's base location before use.
+
+Prior to v4, all path fields must contain fully-qualified paths. Starting with 
v4, path fields may contain either absolute or relative paths. [Relative 
resolution within a 
URI](https://datatracker.ietf.org/doc/html/rfc3986#section-5.2) (e.g. `.` and 
`..`) and other file system navigation conventions are not supported in 
relative paths.
+
+#### Path Resolution
+
+Path resolution is the process of producing an absolute path from a relative 
path by combining it with the table's base location:
+
+* If the path contains a URI scheme, it is absolute and is used without 
modification.
+* If the path does not contain a URI scheme, the resolved path is the table 
location followed by the relative path joined by the URI separator character 
`/`.
+
+Paths used as prefixes should not end in a path separator. The relative 
portion is joined to the prefix without consideration of any additional 
separator characters.
+
+Any path from a manifest produced prior to v4 is a fully-qualified path and 
must be produced with a URI scheme if the scheme was omitted to be consistent 
with V4 paths.
+
+Examples of path resolution:
+
+|                 | Format Version | Table Location       | File Path          
                      | Resolved Path                             | Description 
|
+|-----------------|----------------|----------------------|------------------------------------------|-------------------------------------------|-------------|
+| Relative Path   | v4             | s3://bucket/db/table | 
data/00000-0.parquet                     | 
s3://bucket/db/table/data/00000-0.parquet | Path parts are joined on `/` |
+| Absolute Path   | v4             | s3://bucket/db/table | 
hdfs:/wh/db/table/data/00000-0.parquet   | 
hdfs://wh/db/table/data/00000-0.parquet   | Absolute path is used |
+| Fully-qualified | v3 and earlier | s3://bucket/db/table | 
s3://bucket/db/table/data/00000-0.parquet | 
s3://bucket/db/table/data/00000-0.parquet | Fully-qualified path is used |
+| Missing scheme  | v3 and earlier | /wh/db/table         | 
/wh/db/table/data/00000-0.parquet        | 
hdfs:/wh/db/table/data/00000-0.parquet    | Scheme is prepended for consistency 
|
+
+#### Path Relativization
+
+Path relativization is the process of converting an absolute path to a 
relative path by removing the table location prefix. This is used when 
persisting paths to metadata files.
+
+* If an absolute path starts with the table location, the table location 
prefix should be removed along with the separator character and the remaining 
relative portion stored.
+* If an absolute path does not start with the table location, it is stored as 
an absolute path.
+
+#### Table Location Specification
+
+When the `location` field is present in table metadata, it is used directly as 
the table's base location. When the `location` field is not present (v4 and 
later), the table location must be provided.  How the table location is 
persisted or determined when not specified in metadata is not a table-level 
concern; catalogs should and provide a table's location

Review Comment:
   Two issues in this paragraph:
   
   1. **Grammatical error**: "catalogs should and provide a table's location" 
is missing a verb (e.g. "manage and provide"), and the sentence has no closing 
period.
   2. **Implicit actor**: "the table location must be provided" doesn't say by 
whom, then the next clause silently assumes the catalog.
   
   Suggested rewrite:
   
   > When the `location` field is present in table metadata, it is used 
directly as the table's base location. When the `location` field is not present 
(v4 and later), the catalog must provide the table location. How the catalog 
persists or determines a table's location is outside the scope of this spec.
   
   Changes:
   
   - Names the catalog as the actor in the second sentence.
   - Folds "catalogs should and provide a table's location" into the new third 
sentence, replacing the broken clause and adding the missing period.
   - "is not a table-level concern" → "is outside the scope of this spec" — 
clearer about what the spec is leaving unspecified.
   



##########
format/spec.md:
##########
@@ -1647,6 +1732,30 @@ The binary single-value serialization can be used to 
store the lower and upper b
 
 ## Appendix E: Format version changes
 
+### Version 4
+
+Relative path support is added in v4.
+
+Reading v3 metadata for v4:
+
+* All location fields are fully-qualified paths and interpreted as absolute 
paths for v4
+* Any location field without a uri scheme prefix must prepend a scheme 
component consistent with v4 absolute paths
+
+Writing v4 metadata:
+
+* Table metadata JSON:
+    * `location` is now optional and must be absolute when present
+    * When not present, the table location must be managed externally and 
provided when loading the metadata
+* Location fields in all metadata structures may contain relative paths
+* Writers should produce relative paths by default for files that reside under 
the table location
+* Absolute paths must be used for files that do not share a common prefix with 
the table location

Review Comment:
   Same "shares a common prefix" ambiguity flagged in the inline at line 1905. 
Line 1750 above ("files that reside under the table location") is also informal 
terminology for what is now formally defined in [Path 
Relativization](#path-relativization).
   
   Suggested rewording for lines 1750–1751:
   
   > * Writers should produce relative paths by default for files whose 
absolute path starts with the table location followed by the URI separator `/`.
   > * Absolute paths must be used for all other files.
   
   This way the writer rule is exactly the Path Relativization rule applied — 
no chance of producing a relative form that doesn't round-trip safely under 
table relocation.
   



##########
format/spec.md:
##########
@@ -1777,6 +1886,24 @@ Note that these requirements apply when writing data to 
a v2 table. Tables that
 
 This section covers topics not required by the specification but 
recommendations for systems implementing the Iceberg specification to help 
maintain a uniform experience.
 
+### Path Construction
+
+Path construction is the process by which new file locations are created for 
output files referenced by metadata. While the specific construction logic is 
not strictly required by the spec, the following guidance is provided for 
reference implementations to encourage consistency.
+
+The table properties `write.metadata.path` and `write.data.path` control where 
metadata and data files are written relative to the table location. When not 
specified, these default to the values `metadata` and `data` respectively.

Review Comment:
   This sentence contradicts the bullets directly below. The opening says the 
properties control where files are written **"relative to the table 
location,"** but the bullets at lines 1897 and 1902 explicitly handle the case 
where `write.metadata.path` / `write.data.path` is an **absolute** path that's 
used directly, bypassing the table location.
   
   Suggested rewrite:
   
   > The table properties `write.metadata.path` and `write.data.path` control 
where metadata and data files are written. Each property may be an absolute 
path or a relative path; relative paths are resolved against the table 
location. When not specified, these default to the relative values `metadata` 
and `data`.
   
   This removes the contradiction and surfaces the absolute/relative 
distinction upfront so the bullets that follow read as enumeration rather than 
exceptions.
   



##########
format/spec.md:
##########
@@ -1647,6 +1732,30 @@ The binary single-value serialization can be used to 
store the lower and upper b
 
 ## Appendix E: Format version changes
 
+### Version 4
+
+Relative path support is added in v4.

Review Comment:
   I feel we probably need to introduce sub headers for Version 4, as we have 
so many changes scoped for V4. if each feature has multiple paragraphs, it is 
kind to track.
   
   The alternative is that each feature only has bullet points. E.g. the row 
lineage in V3 section below has a large bullet list.
   
   I am slightly favor the former.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to