rdblue commented on a change in pull request #1348:
URL: https://github.com/apache/iceberg/pull/1348#discussion_r482592010
##########
File path:
flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java
##########
@@ -164,16 +168,51 @@ private void commitUpToCheckpoint(long checkpointId) {
pendingDataFiles.addAll(dataFiles);
}
- AppendFiles appendFiles = table.newAppend();
- pendingDataFiles.forEach(appendFiles::appendFile);
- appendFiles.set(MAX_COMMITTED_CHECKPOINT_ID, Long.toString(checkpointId));
- appendFiles.set(FLINK_JOB_ID, flinkJobId);
- appendFiles.commit();
+ if (replacePartitions) {
+ replacePartitions(pendingDataFiles, checkpointId);
+ } else {
+ append(pendingDataFiles, checkpointId);
+ }
// Clear the committed data files from dataFilesPerCheckpoint.
pendingFileMap.clear();
}
+ private void replacePartitions(List<DataFile> dataFiles, long checkpointId) {
+ ReplacePartitions dynamicOverwrite = table.newReplacePartitions();
Review comment:
I just want to note that we don't encourage the use of
`ReplacePartitions` because the data it deletes is implicit. It is better to
specify what data should be overwritten, like in the new API for Spark:
```scala
df.writeTo("iceberg.db.table").overwrite($"date" === "2020-09-01")
```
If Flink's semantics are to replace partitions for overwrite, then it should
be okay. But I highly recommend being more explicit about data replacement.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]