pabloem commented on a change in pull request #13558:
URL: https://github.com/apache/beam/pull/13558#discussion_r577368164



##########
File path: sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java
##########
@@ -401,16 +412,40 @@ public ResourceId apply(@Nonnull Metadata input) {
     List<ResourceId> srcToHandle = new ArrayList<>();
     List<ResourceId> destToHandle = new ArrayList<>();
 
-    List<MatchResult> matchResults = matchResources(srcResourceIds);
-    for (int i = 0; i < matchResults.size(); ++i) {
-      if (!matchResults.get(i).status().equals(Status.NOT_FOUND)) {
-        srcToHandle.add(srcResourceIds.get(i));
-        destToHandle.add(destResourceIds.get(i));
+    List<MatchResult> matchSrcResults = matchResources(srcResourceIds);
+    List<MatchResult> matchDestResults = new ArrayList<>();
+    if (skipExistingDest) {
+      matchDestResults = matchResources(destResourceIds);
+    }
+
+    for (int i = 0; i < matchSrcResults.size(); ++i) {
+      if (matchSrcResults.get(i).status().equals(Status.NOT_FOUND) && 
ignoreMissingSrc) {
+        // If the source is not found, and we are ignoring found source files, 
then we skip it.
+        continue;
       }
+      if (skipExistingDest
+          && matchDestResults.get(i).status().equals(Status.OK)
+          && filesMatch(
+              matchDestResults.get(i).metadata().get(0),
+              matchSrcResults.get(i).metadata().get(0))) {
+        // If the destination exists, and we are skipping when destinations 
exist, then we skip.
+        continue;
+      }
+      srcToHandle.add(srcResourceIds.get(i));
+      destToHandle.add(destResourceIds.get(i));
     }
     return KV.of(srcToHandle, destToHandle);
   }
 
+  private static boolean filesMatch(MatchResult.Metadata first, 
MatchResult.Metadata second) {
+    if (!first.checksum().isPresent() && !second.checksum().isPresent()) {

Review comment:
       changed this to null. If both checksums are nullable, only then should 
we rely on the file size - otherwise we should always rely on the checksum (if 
only one file reports a checksum and the other doesnt, then they are not equal, 
which is what happens in the next section)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to