[GitHub] [iceberg] jackye1995 edited a comment on pull request #2465: Core: add row identifier to schema

GitBox Wed, 21 Apr 2021 17:08:28 -0700


jackye1995 edited a comment on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-824442288



   @rdblue @openinx sorry for the delayed update, I tried multiple different 
ways to implement this and went back and forth, and in the end the cleanest one 
I found was to use name to record all the identifier fields set, and at 
`applyChange` time, deletion of identifier column is checked before generating 
the new schema struct to validate deletion of identifier field, other 
validations are done after the new schema struct is generated.
   
   Ryan talked about using 2 sets, one for existing IDs and one for new names. 
The issue with that approach was that the validation checks done for the 2 sets 
are very similar. For existing IDs, I need to check recursively that all the 
parents of the field up to root are struct, which can be done using 
`idToParent` index. For new names, I need to do something very similar, but 
also merge the `IdToParent` with the parent information of newly added fields. 
So in the end the code was very similar to just creating a new schema and 
recreating the `IdToParent` index and a bit redundant. Combining those two 
facts, I think it is much cleaner to just store everything as name and then 
validate name and resolve back to ID after the new schema struct is generated.
   
   Ryan also talked about building an index of added name to ID when adding new 
columns, and with this approach that is not needed because the new schema would 
have that index when calling `findField`. I think that is a win to not increase 
complexity in `addColumn` and avoid new indexes in `UpdateSchema`.
   
   I can also do the `deletes` check in the validation function, but in the end 
I chose to do it in a separated code block before generating the schema struct, 
because doing it in the validation method requires passing in many arguments, 
and that flow was not intuitive to code readers, it only increased the 
complexity of the code and reduced readability.
   
   There are some other methods I tried to implement this whole thing, such as 
filtering all potential identifier fields in schema and then check if the 
identifiers are in that potential set. But that approach could not produce 
informative error message so I did not go with that route.
   
   I have added some more tests to cover all possible use cases, please let me 
know if you have any additional concern with the approach.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 edited a comment on pull request #2465: Core: add row identifier to schema

Reply via email to