jackye1995 edited a comment on pull request #2465: URL: https://github.com/apache/iceberg/pull/2465#issuecomment-824442288
@rdblue @openinx sorry for the delayed update, I tried multiple different ways to implement this and went back and forth, and in the end the cleanest one I found was to use name to record all the identifier fields set, and at `applyChange` time, deletion of identifier column is checked before generating the new schema struct to validate deletion of identifier field, other validations are done after the new schema struct is generated. Ryan talked about using 2 sets, one for existing IDs and one for new names. The issue with that approach was that the validation checks done for the 2 sets are very similar. For existing IDs, I need to check recursively that all the parents of the field up to root are struct, which can be done using `idToParent` index. For new names, I need to do something very similar, but also merge the `IdToParent` with the parent information of newly added fields. So in the end the code was very similar to just creating a new schema and recreating the `IdToParent` index and a bit redundant. Combining those two facts, I think it is much cleaner to just store everything as name and then validate name and resolve back to ID after the new schema struct is generated. Ryan also talked about building an index of added name to ID when adding new columns, and with this approach that is not needed because the new schema would have that index when calling `findField`. I think that is a win to not increase complexity in `addColumn` and avoid new indexes in `UpdateSchema`. I can also do the `deletes` check in the validation function, but in the end I chose to do it in a separated code block before generating the schema struct, because doing it in the validation method requires passing in many arguments, and that flow was not intuitive to code readers, it only increased the complexity of the code and reduced readability. There are some other methods I tried to implement this whole thing, such as filtering all potential identifier fields in schema and then check if the identifiers are in that potential set. But that approach could not produce informative error message so I did not go with that route. I have added some more tests to cover all possible use cases, please let me know if you have any additional concern with the approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
