[GitHub] [iceberg] rdblue commented on pull request #2021: Add the cardinality check to detect ambiguous target row for MERGE INTO

GitBox Mon, 04 Jan 2021 10:48:26 -0800


rdblue commented on pull request #2021:
URL: https://github.com/apache/iceberg/pull/2021#issuecomment-754148565



   Sorry I didn't look into this more thoroughly before now, but I think that 
there is a better way to implement the cardinality check that doesn't require 
an inner join.
   
   The (2003) SQL spec in Part 2, section 14.9 says:
   
   > Let Q be the result of evaluating [the USING clause]
   > If <merge when matched clause> is specified, then:
   > For each row R1 of T:
   > ...
   > Let M be the number of matching rows in Q for R1.
   > If M is greater than 1 (one), then an exception condition is raised: 
cardinality violation.
   
   I think that means the cardinality check is done for each row of the target 
table. If two rows from the target table have the same "key" from the `ON` 
condition, that's okay and they are processed separately. The cardinality 
violation applies when there are two rows from the result of the `USING` clause 
for any row.
   
   That aligns with [a comment from Eugene 
Koifman](https://community.cloudera.com/t5/Support-Questions/Hive-Merge-command-throwing-error-message/td-p/228500)
 (from the Hive implementation):
   
   > Logically what this means is that the query is asking the system to update 
1 existing row in target in 2 (or more) different ways.
   
   As a result, I think that there is an easier way to implement this check. 
Instead of running an additional inner join, the check should run as the first 
thing in 
[`processRow`](https://github.com/apache/iceberg/pull/2022/files#diff-deb1276df77b1ac20d203d4607231c593cfeac1a3c3771d174bd04fc1afd773aR92):
   1. Add `monotonicaly_increasing_id` to the target table rows
   2. In `processRow`, keep track of the last row's ID
   3. If the last row's ID matches the current row's ID, throw an exception for 
the cardinality check
   4. If the current row's ID is different, then replace the last row ID with it
   
   That works because the join will always produce the matching rows together 
as a group.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #2021: Add the cardinality check to detect ambiguous target row for MERGE INTO

Reply via email to