RussellSpitzer commented on issue #12263: URL: https://github.com/apache/iceberg/issues/12263#issuecomment-3699764343
> > With regards to what's missing in branching; I think it's primarily a matter of independent lifecycles. Tracking the schema id in the branch is definitely a step in the right direction I think, but it would break the current feature (and I assume, intended purpose) of branching (mainly; inserted data in a branch is validated against the schema of the table, ensuring a possible fastforward to the main branch later on). I don't see why we couldn't fast forward with different schemas for certain schema changes, obviously some are impossible but some branch modifications are also not mergable > > But a branch is inherently always a child resource of a table, and not a descendant. This means that there will always be a contention; if I have 50 active branches (and yes, this is a real scenario[1](#user-content-fn-1-a3fac1fbdc27c84ac07190d88ab8e9fe)) I suddenly have 50 writers trying to update the table metadata concurrently. > Cloning would create a new table, but with existing data files, so in this example, all writers would write to their own metadata files with no contention. Access control is another aspect, but as stated earlier by someone that seems more like a catalog concern (and I even think that is already possible to implement). This is where I keep having a problem, "child resource" and "descendent" are the same in my mind and the "clone" always tied to the source regardless of what we call it. The "clone" uses the same data files and the way we track data files is through metadata so our "clone" would always need to check parent metadata before cleaning any data file (and vice versa). > Cloning would create a new table, but with existing data files, so in this example, all writers would write to their own metadata files with no contention. As I noted, the only way to eliminate this contention is to disconnect the metadata ties which breaks data file ownership, I guess you could do clean up out of band with all other table operations but again that's far outside the scope of the table, so probably would have to be tracked by the catalog. (I'm actually not sure you can do this without the catalog expressly knowing all the files in all tables, it would have to either do reference counting or check for unreferenced files in a out-of-band cleanup operation I think) > Access control is another aspect, but as stated earlier by someone that seems more like a catalog concern (and I even think that is already possible to implement). Yep this one would be done by just not vending credentials to files in the "source" table or something like that, or identifiers with special mappings to branches. Definitely paths forwards here without a new api. > * `next-row-id` on the table is updated across branches. Not sure why this actually would matter in practice > * history within a branch is mixed with the table history; if a branch is updated, the newest table metadata will show the new snapshot-id as the branch, but you must apply a two-step lookup to determine the previous snapshot-id of that branch (mainly: look up the snapshot-id prior to the current snapshot-id of the branch, load that metadata and from there look up what the snapshot-id of the branch was at that point in time). I'm not sure why this is an issue either, we already use parent-id to find the ancestor of snapshots? ----- I'm not trying to shut down this idea, i'm just trying to point out that we need a really good justification as to why the branching approach can't meet the demand. Currently it seems to me that every time we start moving down this path we get to a point where we essentially need a completely different commit/snapshot tracking mechanism than the one that Iceberg has right now and to me that's a heavy lift. I'm trying to make sure if we do design a new mechanism we are getting something really good out of that we can't get out of existing structures. I think we should work on the key requirements, so far I think the only ones that I kind of buy are "Schema needs to be independent" "We want commits not to conflict" As I mentioned I think those are both fixable within the current protocol. After all within an Iceberg REST commit there is no reason why a commit to a branch should *ever* conflict with a commit to another branch and the only reason they do in the Old catalog protocol is because we rewrite metadata.json optimistically each time. This shouldn't be the case in a REST Catalog scenario. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
