rdblue commented on pull request #1947: URL: https://github.com/apache/iceberg/pull/1947#issuecomment-747819953
I agree with all of the points that @aokolnychyi brought up. I also have a few suggestions on how to do this more cleanly. > We should think about how to group data before writing. Please take a look at #1955. That exposes `_pos` so that we can use it. I would suggest the following: * If the table has a sort order, add a global sort of all the rows produced by the merge node. * If the table does not have a sort order, then add a default sort by `_file`, `_pos`, partition columns, and the `MERGE` condition's references. That way, existing rows that have non-null `_file` and `_pos` will mostly preserve the sort order in the source data files (except for changed columns). Inserted rows will have nulls for `_file` and `_pos` and will then be sorted by partition columns to minimize the number of output files, and then by columns from the merge condition because those are likely to be high cardinality (giving Spark the ability to split partitions). > We should align the assignments according to the target columns. This also applies to UPDATE. Agreed. I think that we should have a rule similar to the logic in [`TableOutputResolver`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L30). One branch would modify `UPDATE` and add expressions to pull unchanged column values from the existing row. Another branch would modify `INSERT` by rearranging the columns by name for `INSERT (_names_) VALUES (_vals_)`. And we should also have a `MergeOutputCheck` rule to assert that each `INSERT` or `UPDATE` action is aligned with the output of the merge node. Modifying and checking the logical plan in the analyzer like this will require analyzer rules and a logical plan that doesn't contain `MergeIntoProcessor`. We will need a logical plan that the normal analysis rules can run on. Then we can hopefully remove the resolution from the rewrite as well. > We will need more tests Definitely. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
