rdblue commented on code in PR #6965:
URL: https://github.com/apache/iceberg/pull/6965#discussion_r1127084322
##########
spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMerge.java:
##########
@@ -214,11 +220,14 @@ public void
testMergeIntoEmptyTargetInsertAllNonMatchingRows() {
row(3, "emp-id-3") // new
);
assertEquals(
- "Should have expected rows", expectedRows, sql("SELECT * FROM %s ORDER
BY id", tableName));
+ "Should have expected rows",
+ expectedRows,
+ sql("SELECT * FROM %s ORDER BY id", selectTarget()));
}
@Test
public void testMergeIntoEmptyTargetInsertOnlyMatchingRows() {
+ Assume.assumeTrue("Branch can only be used for non-empty table", branch ==
null);
Review Comment:
> I also agree it's strange to create a new branch for operations like
UPDATE and DELETE. However, if I think from the perspective of branching being
an improved version of WAP, then it feels less strange because we are just
saying the result of a write is staged in the specific branch until published.
I think we do want to use this for an improved version of WAP. But do we
necessarily need to have implicit branch creation everywhere? I don't think
that's required, and if we decide to add it we can always do it later.
> The difference is that, in the WAP case, it won't explicitly say DELETE
FROM table.branch_xxx and the branch is supposed to be hidden as an environment
setting
That's how WAP currently works, but we don't necessarily need branches to
behave that way. We could create a `spark.wap.branch` setting that adds
implicit branching and also ensures that all reads use that branch if it
exists. That's well-defined behavior for an environment setting.
I'm more skeptical of implicit branch creation when performing normal branch
writes using the `table.branch_x` naming. If you're specifically writing to a
branch by name then it seems like it should exist before that point. If we
require it to exist, then we avoid needing a behavior decision about how to
initialize the state of an implicitly created branch.
> So if user does SELECT * from table.branch_not_exist it will just read the
current table instead of saying branch not found.
I see your logic and I think that this follows, but I also don't think it is
a good idea to define the state of a non-existent branch to be the current
table state. Again, I think we could introduce an environment state that makes
this the behavior specifically for WAP, but that shouldn't be the default for
all branches and doesn't need to be added right now.
The most natural consequence of reading a branch that doesn't exist is to
get an error. Otherwise, we get confusing behavior. For example, if you tell me
you'll create a branch for me to look at in a table but I look at it before you
prepare the data, I still get results. It also opens up problems when a name is
slightly off; say I want to read _tag_ `q4_22_audit` but I actually read
`q4_2022_audit` and get the current table state because the branch/tag doesn't
exist. That would be really confusing for users.
I'm all for having ways to make WAP better using branches, but when we
follow this line of reasoning it leads to confusing behavior in the more common
cases. So I think we should make branches work in simple cases and then add
more complex features on top.
I don't think this severely limits the utility of branches, since we can go
through and add WAP on top as a next step.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]