rdblue commented on pull request #1947:
URL: https://github.com/apache/iceberg/pull/1947#issuecomment-747819953


   I agree with all of the points that @aokolnychyi brought up. I also have a 
few suggestions on how to do this more cleanly.
   
   > We should think about how to group data before writing.
   
   Please take a look at #1955. That exposes `_pos` so that we can use it. I 
would suggest the following:
   * If the table has a sort order, add a global sort of all the rows produced 
by the merge node.
   * If the table does not have a sort order, then add a default sort by 
`_file`, `_pos`, partition columns, and the `MERGE` condition's references. 
That way, existing rows that have non-null `_file` and `_pos` will mostly 
preserve the sort order in the source data files (except for changed columns). 
Inserted rows will have nulls for `_file` and `_pos` and will then be sorted by 
partition columns to minimize the number of output files, and then by columns 
from the merge condition because those are likely to be high cardinality 
(giving Spark the ability to split partitions).
   
   > We should align the assignments according to the target columns. This also 
applies to UPDATE.
   
   Agreed. I think that we should have a rule similar to the logic in 
[`TableOutputResolver`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L30).
 One branch would modify `UPDATE` and add expressions to pull unchanged column 
values from the existing row. Another branch would modify `INSERT` by 
rearranging the columns by name for `INSERT (_names_) VALUES (_vals_)`.
   
   And we should also have a `MergeOutputCheck` rule to assert that each 
`INSERT` or `UPDATE` action is aligned with the output of the merge node.
   
   Modifying and checking the logical plan in the analyzer like this will 
require analyzer rules and a logical plan that doesn't contain 
`MergeIntoProcessor`. We will need a logical plan that the normal analysis 
rules can run on. Then we can hopefully remove the resolution from the rewrite 
as well.
   
   > We will need more tests
   
   Definitely.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to