rdblue commented on pull request #2021: URL: https://github.com/apache/iceberg/pull/2021#issuecomment-754148565
Sorry I didn't look into this more thoroughly before now, but I think that there is a better way to implement the cardinality check that doesn't require an inner join. The (2003) SQL spec in Part 2, section 14.9 says: > Let Q be the result of evaluating [the USING clause] > If <merge when matched clause> is specified, then: > For each row R1 of T: > ... > Let M be the number of matching rows in Q for R1. > If M is greater than 1 (one), then an exception condition is raised: cardinality violation. I think that means the cardinality check is done for each row of the target table. If two rows from the target table have the same "key" from the `ON` condition, that's okay and they are processed separately. The cardinality violation applies when there are two rows from the result of the `USING` clause for any row. That aligns with [a comment from Eugene Koifman](https://community.cloudera.com/t5/Support-Questions/Hive-Merge-command-throwing-error-message/td-p/228500) (from the Hive implementation): > Logically what this means is that the query is asking the system to update 1 existing row in target in 2 (or more) different ways. As a result, I think that there is an easier way to implement this check. Instead of running an additional inner join, the check should run as the first thing in [`processRow`](https://github.com/apache/iceberg/pull/2022/files#diff-deb1276df77b1ac20d203d4607231c593cfeac1a3c3771d174bd04fc1afd773aR92): 1. Add `monotonicaly_increasing_id` to the target table rows 2. In `processRow`, keep track of the last row's ID 3. If the last row's ID matches the current row's ID, throw an exception for the cardinality check 4. If the current row's ID is different, then replace the last row ID with it That works because the join will always produce the matching rows together as a group. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
