Re: Relative paths and location resolution

Yufei Gu Tue, 02 Jun 2026 09:26:56 -0700

This sounds like a slightly different problem to me. Sharing more context,
the rewrite_table_path procedure[1] already has machinery to reason about
the delta between snapshots and prepare a file copy plan for table
replication. That seems closer to the granularity Samuel is asking for than
path resolution itself.


I wonder if the better solution is to have an additional replication tool
that tracks which snapshots or sequence ranges have been copied to which
locations, similar in spirit to rewrite_table_path. That tool could own the
replication state and produce copy plans, or region-specific table
locations. Then relative path resolution can remain simple, while more
advanced replication logic stays outside the core FileIO path resolution
layer.

1.
https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_table_path

Yufei


On Tue, Jun 2, 2026 at 8:51 AM samuel pacheco cantu via dev <
[email protected]> wrote:

> To give more details, yes,  the idea is that at the moment of staging a
> commit, we can create a directory under warehouse_location/commit_id/ that
> contains all data and metadata for that commit. Once committed,  we can
> start replicating that directory.  Each commit_id would map to a
> sequence-id.
>
> With this design, it's possible to partially replicate commits across
> multiple regions by using a bitmap to determine if sequence-id N has
> already been replicated to region X.
>
> Regarding the writer path construct,  yeah I see the LocationProvider is
> in charge of creating the path. My question is about the FileIO input
> files: does it make sense to keep the metadata as relative paths (or
> absolute paths), and simply swap the files we read when getting the input
> files? This might be a hacky and inelegant solution.  I'm reaching out to
> understand which layer will be in charge of resolving the paths.
>
> The current relative path solution sounds like it would add checks within
> the parsing code to see if the URI is relative, and then join the
> table_location with the relative_path if applicable. I'm curious if, at the
> very least, it would make sense to make it extendable by considering more
> metadata.  For our use case, we are exploring the sequence ID as part of
> the routing.
>
> On Mon, Jun 1, 2026 at 5:22 PM Steven Wu <[email protected]> wrote:
>
>> It seems that Sam wants a table to hold data files from multiple live
>> regions (prefixes). The current design only supports a single prefix. On
>> Mon, Jun 1, 2026 at 3: 19 PM Daniel Weeks <dweeks@ apache. org> wrote:
>> Hey Sam, I'm not sure
>> It seems that Sam wants a table to hold data files from multiple live
>> regions (prefixes). The current design only supports a single prefix.
>>
>> On Mon, Jun 1, 2026 at 3:19 PM Daniel Weeks <[email protected]> wrote:
>>
>>> Hey Sam,
>>>
>>> I'm not sure I fully understand the scenario you're describing, but
>>> relative paths the basic concept is that you have a table location
>>> (provided by a catalog) and files are resolved relative to that table
>>> location.
>>>
>>> Some example are provided in the spec
>>> <https://urldefense.com/v3/__https://iceberg.apache.org/spec/*path-resolution__;Iw!!Bt8RZUm9aw!5VJhaXv3HA50KqjLyUDUV7PikhxSsRDBSK3vP3lq783pVjyin1j9qtjPCNkQBJSm1a0xe0agLSyqe7M$>
>>> .
>>>
>>>
>>>    1.
>>>
>>>    What is the intended use case for relative paths in Iceberg? Is it
>>>    designed primarily for DR/replication scenarios?  What about real-time
>>>    replication?
>>>
>>> The design accommodates DR/replication with proper catalog
>>> implementations to route or provide the table location.  The act of
>>> replicating the files is left out of the spec, but can be realtime
>>> depending on the implementation.
>>>
>>>    1. At what point can a manifest or data file's relative path be
>>>    resolved to an absolute path? Does the current design assume all 
>>> referenced
>>>    data is already available locally?
>>>
>>> Paths are resolved when they're read out of manifests.  If you have a
>>> reference in metadata to a file, it should exist or readers will fail when
>>> fetching the file.  By the time you perform a commit operation, it must be
>>> referenceable.
>>>
>>>    1. In FileIO, newInputFile(String path) takes a raw path string. Is
>>>    there a planned mechanism to provide additional metadata (like sequence
>>>    context) to help resolve paths in more complex topologies?
>>>
>>> A writer can construct paths in any way they want. Reference
>>> implementation behaviors are described in the appendix section, but there's
>>> no requirement for how they're constructed.  Relative path support is still
>>> being added to the reference implementation, but path construction is
>>> largely the responsibility of LocationProvider.  The path logic focuses on
>>> resolving or relativizing paths, not constructing them.
>>>
>>> -Dan
>>>
>>>
>>> On Mon, Jun 1, 2026 at 12:28 PM samuel pacheco cantu via dev <
>>> [email protected]> wrote:
>>>
>>>> Helo everyone,
>>>>
>>>> I have a question about relative-path resolution in the context of
>>>> multi-region replication.
>>>>
>>>> *Context:* We have a use case where data files may reside in different
>>>> storage locations depending on the replication state. To resolve a relative
>>>> path, we'd need additional context (e.g., the commit's sequence-id) to
>>>> determine which region/scheme a given file should resolve to.
>>>>
>>>> We are actually thinking about swapping the absolute path scheme while
>>>> we wait for relative-path support.  We plan to do this at the FileIO layer
>>>> when requesting new input files.
>>>>
>>>> The problem we've got on doing the swap at the FileIO is that there are
>>>> raw string path calls without not context to do any routing decision.  I
>>>> would expect the same problem to occur here for relative-paths where there
>>>> isn't enough context to determine the scheme.  The same argument can be
>>>> made that we require even more metadata to support more complicated
>>>> use-cases, such as sequence-id (and/or data-sequence-id) .
>>>>
>>>>
>>>> *Questions:*
>>>>
>>>>    1.
>>>>
>>>>    What is the intended use case for relative paths in Iceberg? Is it
>>>>    designed primarily for DR/replication scenarios?  What about real-time
>>>>    replication?
>>>>    2. At what point can a manifest or data file's relative path be
>>>>    resolved to an absolute path? Does the current design assume all 
>>>> referenced
>>>>    data is already available locally?
>>>>    3. In FileIO, newInputFile(String path) takes a raw path string. Is
>>>>    there a planned mechanism to provide additional metadata (like sequence
>>>>    context) to help resolve paths in more complex topologies?
>>>>
>>>> We'd like to understand Iceberg's direction on relative-path resolution
>>>> so we can align our approach with the community rather than diverging.
>>>>
>>>>
>>>> Thanks,
>>>> Sam
>>>>
>>>>

Re: Relative paths and location resolution

Reply via email to