[GitHub] [iceberg] rdblue commented on pull request #6651: Spark 3.3 write to branch snapshot

via GitHub Thu, 16 Feb 2023 16:32:36 -0800


rdblue commented on PR #6651:
URL: https://github.com/apache/iceberg/pull/6651#issuecomment-1433928069


   > 1.) We won't be able to really write tests for the SparkPositionDeltaWrite 
until the SQL conf change is in because this PR just focuses on data frame 
write options
   
   I think to solve this, we should add back the Spark SQL session write 
config. The problem I raised in [my 
comment](https://github.com/apache/iceberg/pull/6651#discussion_r1101931339) 
was only a problem for the read option. If we remove the read session config 
then we can avoid it. But having the write config doesn't have the same issue 
(because writes automatically handle cases where the branch doesn't exist) and 
adding it back would allow us to run the tests.
   
   > 2.) Behavior of createOrReplaceTable when the branch write option is 
passed in. Say for example a user does the following:
   > 
   > ```
   > data.writeTo("prod.db.table")
   >     .tableProperty("write.format.default", "orc")
   >     .partitionedBy($"level", days($"ts"))
   >    .writeOption(SparkWriteOption.BRANCH, "branch")
   >     .createOrReplace()
   > ```
   > 
   > What should the desired behavior be? Currently what happens is the write 
will be performed on branch and then if it's a replace the main table state 
will stay the same. So the `replace` loses its meaning. Another possible 
desired behavior is the branch is written to and replace will actually perform 
a replace branch to replace the head of main with the head of the target 
branch. This is what makes sense to me but I think it'll be hard to achieve 
this based on catalog/planner abstractions.
   
   I think the simplest answer is that we can just disallow `branch` with 
staged Spark tables. The branch should be passed through `LogicalWriteInfo` so 
we can disable this by adding a method to `StagedSparkTable` that is used for 
atomic create/replace:
   
   ```java
     @Override
     public WriteBuilder newWriteBuilder(LogicalWriteInfo info) {
       String branch = info.options().get(SparkWriteOptions.BRANCH);
       if (branch != null && !branch.equalsIgnoreCase(SnapshotRef.MAIN_BRANCH)) 
{
         throw new UnsupportedOperationException(
             String.format("Cannot create or replace table with branch: %s", 
branch));
       }
   
       return super.newWriteBuilder(info);
     }
   ```
   
   Eventually, we may want to handle this case and decide on behavior. The 
`create`, `replace`, and `createOrReplace` methods in `DataFrameWriterV2` 
completely replace the table state, including the schema and partitioning. So 
the correct behavior for a branch is to add the new schema, partition spec, and 
properties, then create a new snapshot and set the branch state to that 
snapshot. This would be similar to the behavior in 
`TableMetadata.buildReplacement`, but slightly different because it would not 
need to set the table's current schema and partition spec, just the ones for 
the branch. We should definitely not handle this right away and should instead 
just disable it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #6651: Spark 3.3 write to branch snapshot

Reply via email to