[GitHub] [iceberg] rdblue commented on a diff in pull request #6965: Spark 3.3: Support write to branch through table identifier

via GitHub Mon, 06 Mar 2023 14:04:59 -0800


rdblue commented on code in PR #6965:
URL: https://github.com/apache/iceberg/pull/6965#discussion_r1127084322



##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java:
##########
@@ -214,11 +220,14 @@ public void 
testMergeIntoEmptyTargetInsertAllNonMatchingRows() {
             row(3, "emp-id-3") // new
             );
     assertEquals(
-        "Should have expected rows", expectedRows, sql("SELECT * FROM %s ORDER 
BY id", tableName));
+        "Should have expected rows",
+        expectedRows,
+        sql("SELECT * FROM %s ORDER BY id", selectTarget()));
   }
 
   @Test
   public void testMergeIntoEmptyTargetInsertOnlyMatchingRows() {
+    Assume.assumeTrue("Branch can only be used for non-empty table", branch == 
null);

Review Comment:
   > I also agree it's strange to create a new branch for operations like 
UPDATE and DELETE. However, if I think from the perspective of branching being 
an improved version of WAP, then it feels less strange because we are just 
saying the result of a write is staged in the specific branch until published.
   
   I think we do want to use this for an improved version of WAP. But do we 
necessarily need to have implicit branch creation everywhere? I don't think 
that's required, and if we decide to add it we can always do it later.
   
   > The difference is that, in the WAP case, it won't explicitly say DELETE 
FROM table.branch_xxx and the branch is supposed to be hidden as an environment 
setting
   
   That's how WAP currently works, but we don't necessarily need branches to 
behave that way. We could create a `spark.wap.branch` setting that adds 
implicit branching and also ensures that all reads use that branch if it 
exists. That's well-defined behavior for an environment setting.
   
   I'm more skeptical of implicit branch creation when performing normal branch 
writes using the `table.branch_x` naming. If you're specifically writing to a 
branch by name then it seems like it should exist before that point. If we 
require it to exist, then we avoid needing a behavior decision about how to 
initialize the state of an implicitly created branch.
   
   > So if user does SELECT * from table.branch_not_exist it will just read the 
current table instead of saying branch not found.
   
   I see your logic and I think that this follows, but I also don't think it is 
a good idea to define the state of a non-existent branch to be the current 
table state. Again, I think we could introduce an environment state that makes 
this the behavior specifically for WAP, but that shouldn't be the default for 
all branches and doesn't need to be added right now.
   
   The most natural consequence of reading a branch that doesn't exist is to 
get an error. Otherwise, we get confusing behavior. For example, if you tell me 
you'll create a branch for me to look at in a table but I look at it before you 
prepare the data, I still get results. It also opens up problems when a name is 
slightly off; say I want to read _tag_ `q4_22_audit` but I actually read 
`q4_2022_audit` and get the current table state because the branch/tag doesn't 
exist. That would be really confusing for users.
   
   I'm all for having ways to make WAP better using branches, but when we 
follow this line of reasoning it leads to confusing behavior in the more common 
cases. So I think we should make branches work in simple cases and then add 
more complex features on top.
   
   I don't think this severely limits the utility of branches, since we can go 
through and add WAP on top as a next step.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a diff in pull request #6965: Spark 3.3: Support write to branch through table identifier

Reply via email to