JarroVGIT commented on issue #12263:
URL: https://github.com/apache/iceberg/issues/12263#issuecomment-3697676433

   @RussellSpitzer : I agree, tagging has its shortcomings in this and my post 
can be considered more of a thought experiment than a well thought-out 
proposal. It got pretty far but it breaks quickly with less simple scenario's.
   
   With regards to what's missing in branching; I think it's primarily a matter 
of independent lifecycles. Tracking the schema id in the branch is definitely a 
step in the right direction I think, but it would break the current feature 
(and I assume, intended purpose) of branching (mainly; inserted data in a 
branch is validated against the schema of the table, ensuring a possible 
fastforward to the main branch later on).
   
   But a branch is inherently always a child resource of a table, and not a 
descendant. This means that there will always be a contention; if I have 50 
active branches (and yes, this is a real scenario[^1]) I suddenly have 50 
writers trying to update the table metadata concurrently. Cloning would create 
a new table, but with existing data files, so in this example, all writers 
would write to their own metadata files with no contention. Access control is 
another aspect, but as stated earlier by someone that seems more like a catalog 
concern (and I even think that is already possible to implement). Other 
examples where branches will interfere with one another and are lacking an 
independent lifecycle are:
   - `next-row-id` on the table is updated across branches.
   - history within a branch is mixed with the table history; if a branch is 
updated, the newest table metadata will show the new snapshot-id as the branch, 
but you must apply a two-step lookup to determine the previous snapshot-id of 
that branch (mainly: look up the snapshot-id prior to the current snapshot-id 
of the branch, load that metadata and from there look up what the snapshot-id 
of the branch was at that point in time).
   
   So, yeah, reading your first sentence again: 
   
   > Basically the introduction the catalog brings us back to the idea that we 
need to centralize the information about which tables own which snapshots in 
the same system.
   
   I think you are correct; this is hardly possible without a centralised 
system that tracks this information across tables. Would this then be more 
suitable as an evolution on the REST spec you think? 
   
   
   [^1]: I have seen test suites that run several integration tests in parallel 
on clones of production tables, for example. Another example is a research 
department where clones are used for experimentation, where dozens of people 
work on their own clones of the same production table. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to