pabloem commented on a change in pull request #13558:
URL: https://github.com/apache/beam/pull/13558#discussion_r577810302
##########
File path:
sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java
##########
@@ -764,7 +764,11 @@ final void moveToOutputFiles(
}
// During a failure case, files may have been deleted in an earlier
step. Thus
// we ignore missing files here.
- FileSystems.rename(srcFiles, dstFiles,
StandardMoveOptions.IGNORE_MISSING_FILES);
+ FileSystems.rename(
+ srcFiles,
+ dstFiles,
+ StandardMoveOptions.IGNORE_MISSING_FILES,
+ StandardMoveOptions.SKIP_IF_DESTINATION_EXISTS);
Review comment:
the behavior that we checked in GCS is that we will not encounter
'incomplete' files. But in fact, we only consider a file 'incomplete' if it has
a different checksum (or size in absence of checksum). For other file systems,
the same rationale applies: Encountering a file with equal checksum means we
have the exact same file, and we don't need to rewrite it.
Only filesystem where this is a tough assumption is HadoopFileSystem, where
we don't have a hash function, and instead we rely solely on the size.
Thoughts?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]