rdblue commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-808606039


   @jackye1995 and @openinx, I have a few questions about this before I'm 
comfortable merging it. Thanks for working on this so far!
   
   Why do we need to track multiple versions of the row identifier like we do 
for schema, partition spec, and sort order? I think of this as the "fields that 
identify a row". Is it helpful to have more than one view of how rows are 
identified?
   
   To answer that, we need to consider whether two versions are ever valid at 
the same time, and how row IDs are going to evolve over time:
   * Row identifier columns may be set, either to initialize or to fix a 
mistake (e.g., used account_id instead of profile_id)
   * Row identifier columns may be added, when a new identifying column is 
added to the schema (e.g., adding profile_id to a table previously identified 
by only account_id)
   
   I think both of those operations only require setting the current way of 
identifying rows, not keeping track of the previous ways. I'm interested to 
hear what everyone thinks about that and whether there is agreement.
   
   If I'm correct, then I would probably not keep track of multiple versions 
here. If I'm not, then I think we should ask whether the row ID columns should 
be tracked in the schema itself rather than separately versioned, since they 
will probably change at the same time the schema does -- when adding a new 
column that is now part of the identifier.
   
   It would be great to hear from @aokolnychyi on this as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to