[GitHub] [iceberg] rdblue commented on pull request #1947: [WIP] Spark MERGE INTO Support (copy-on-write implementation)

GitBox Thu, 17 Dec 2020 18:04:43 -0800


rdblue commented on pull request #1947:
URL: https://github.com/apache/iceberg/pull/1947#issuecomment-747819953

I agree with all of the points that @aokolnychyi brought up. I also have a
few suggestions on how to do this more cleanly.

> We should think about how to group data before writing.

Please take a look at #1955. That exposes `_pos` so that we can use it. I
would suggest the following:
* If the table has a sort order, add a global sort of all the rows produced
by the merge node.
* If the table does not have a sort order, then add a default sort by
`_file`, `_pos`, partition columns, and the `MERGE` condition's references.
That way, existing rows that have non-null `_file` and `_pos` will mostly
preserve the sort order in the source data files (except for changed columns).
Inserted rows will have nulls for `_file` and `_pos` and will then be sorted by
partition columns to minimize the number of output files, and then by columns
from the merge condition because those are likely to be high cardinality
(giving Spark the ability to split partitions).

> We should align the assignments according to the target columns. This also
applies to UPDATE.

Agreed. I think that we should have a rule similar to the logic in
[`TableOutputResolver`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala#L30).
One branch would modify `UPDATE` and add expressions to pull unchanged column
values from the existing row. Another branch would modify `INSERT` by
rearranging the columns by name for `INSERT (_names_) VALUES (_vals_)`.

And we should also have a `MergeOutputCheck` rule to assert that each
`INSERT` or `UPDATE` action is aligned with the output of the merge node.

Modifying and checking the logical plan in the analyzer like this will
require analyzer rules and a logical plan that doesn't contain
`MergeIntoProcessor`. We will need a logical plan that the normal analysis
rules can run on. Then we can hopefully remove the resolution from the rewrite
as well.

> We will need more tests

Definitely.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #1947: [WIP] Spark MERGE INTO Support (copy-on-write implementation)

Reply via email to