rdblue commented on pull request #2354:
URL: https://github.com/apache/iceberg/pull/2354#issuecomment-809718836


   @openinx, I think that @jackye1995 is right about how the case you described 
would be encoded. The delete files themselves always encode what columns are 
used for the equality delete.
   
   There is no requirement that a delete file's delete columns match the 
table's row identifier fields. That's one reason why we can encode deletes 
right now, before we've added the row identifier tracking. That also enables 
deleting rows by different fields than the row identifier fields, which is what 
makes the evolution case possible.
   
   The row identifier fields are related to deletes only in that in situations 
where we don't have explicit delete columns in the operation, we can default 
the delete columns to the row identifier fields. That's to support the `UPSERT` 
case, where we define the identity fields in table metadata rather than in the 
sink configuration.
   
   From @jackye1995's second comment, I think there is at least some agreement 
that the row identifier columns don't need to be tracked over time. That's 
because there is no way to go back to an older snapshot and then manipulate 
that data. Time travel is read-only and data manipulation is always applied to 
the current snapshot, so it is reasonable that there is only ever one version 
of the row identifier that matters: the one that is configured at the start of 
the operation.
   
   Before moving ahead with this, I think we should simplify it and remove the 
versioning.
   
   I'm also wondering about the field ordering mentioned in the code. Is that 
relevant? I think of the row identifier fields as unordered and simply used to 
produce a projection of the table schema that is a row identifier, in whatever 
field order the schema had. So I would model this as an unordered set of IDs 
rather than as an ordered collection.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to