rdblue commented on pull request #1177: URL: https://github.com/apache/iceberg/pull/1177#issuecomment-668126179
@fbocse, @rdsr, while I was looking at this, I prototyped a way to avoid a couple of the issues that I pointed out with this. I've posted that [union prototype](https://gist.github.com/rdblue/1d39763be41b59b8637d50162f7fe2ea) as a gist. The gist includes a `SchemaWithPartnerVisitor` pattern that is like the `SchemaWithTypeVisitor`. It handles the logic of traversing two schemas at the same time and passing "partner" objects along with the schema that is being traversed. The partner objects are accessed using a small set of methods, like `fieldPartner` to access a field's partner from a struct's partner. The gist also includes an implementation, `UnionByNameVisitor`. That visitor extends `SchemaWithPartnerVisitor`, where the partner is the equivalent field ID in the "existing" schema. By passing the existing schema's IDs, we can avoid keeping track of the ancestor names in the visitor. Instead, the prototype looks up the name in the existing schema from the ID when it needs to call `addColumn` with a parent name. That avoids all of the string logic and the need for `beforeField` callbacks. It also implements my suggestion to return a boolean to indicate whether a type is missing. See what you think about that prototype. I think it's good to have better separation between the traversal logic and the union logic, it is nice to not need to maintain the ancestor list, and it is a bit cleaner when detecting what changes need to be applied. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
