ahmedabu98 commented on code in PR #34264: URL: https://github.com/apache/beam/pull/34264#discussion_r2005817726
########## sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/AppendFilesToTables.java: ########## @@ -211,5 +213,31 @@ private ManifestWriter<DataFile> createManifestWriter( tableLocation, manifestFilePrefix, uuid, spec.specId())); return ManifestFiles.write(spec, io.newOutputFile(location)); } + + // If bundle fails following a successful commit and gets retried, it may attempt to re-commit + // the same data. + // To mitigate, we check the files in this bundle and remove anything that was already + // committed in the last successful snapshot. + // + // TODO(ahmedabu98): This does not cover concurrent writes from other pipelines, where the + // "last successful snapshot" might reflect commits from other sources. Ideally, we would make + // this stateful, but that is update incompatible. + // TODO(ahmedabu98): add load test pipelines with intentional periodic crashing + private Iterable<FileWriteResult> removeAlreadyCommittedFiles( + Table table, Iterable<FileWriteResult> fileWriteResults) { + if (table.currentSnapshot() == null) { + return fileWriteResults; + } + + List<String> committedFiles = + Streams.stream(table.currentSnapshot().addedDataFiles(table.io())) Review Comment: > it seems unlikely that this would fall over I think so too, but updated to manually iterate over the files for good practice anyways -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org