phet commented on code in PR #3686:
URL: https://github.com/apache/gobblin/pull/3686#discussion_r1178773725


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/ManifestBasedDataset.java:
##########
@@ -117,6 +126,7 @@ public Iterator<FileSet<CopyEntity>> 
getFileSetIterator(FileSystem targetFs, Cop
         CommitStep step = new DeleteFileCommitStep(targetFs, toDelete, 
this.properties, Optional.<Path>absent());
         copyEntities.add(new PrePublishStep(datasetURN(), Maps.newHashMap(), 
step, 1));
       }
+      log.info(String.format("Workunits calculation take %s milliseconds to 
process %s files", System.currentTimeMillis() - startTime, numFiles));

Review Comment:
   nit: "take" => "took"
   
   BTW, I really like this timing!  do we want a metric in addition to a logged 
line?



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/ManifestBasedDataset.java:
##########
@@ -78,37 +81,43 @@ public Iterator<FileSet<CopyEntity>> 
getFileSetIterator(FileSystem targetFs, Cop
           + "%s, you can specify multi locations split by '',", 
manifestPath.toString(), fs.getUri().toString(), 
ManifestBasedDatasetFinder.MANIFEST_LOCATION));
     }
     CopyManifest.CopyableUnitIterator manifests = null;
-    List<CopyEntity> copyEntities = Lists.newArrayList();
-    List<FileStatus> toDelete = Lists.newArrayList();
+    List<CopyEntity> copyEntities = 
Collections.synchronizedList(Lists.newArrayList());
+    List<FileStatus> toDelete = 
Collections.synchronizedList(Lists.newArrayList());

Review Comment:
   do these need to be synchronized?



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/CopyableFile.java:
##########
@@ -375,6 +379,35 @@ public static List<OwnerAndPermission> 
resolveReplicatedOwnerAndPermissionsRecur
     return ownerAndPermissions;
   }
 
+  /**
+   * Compute the correct {@link OwnerAndPermission} obtained from replicating 
source owner and permissions and applying
+   * the {@link PreserveAttributes} rules for fromPath and every ancestor up 
to but excluding toPath.
+   * Use permissionMap as a cache to reduce the call to hdfs
+   *
+   * @return A list of the computed {@link OwnerAndPermission}s starting from 
fromPath, up to but excluding toPath.
+   * @throws IOException if toPath is not an ancestor of fromPath.
+   */
+  public static List<OwnerAndPermission> 
resolveReplicatedOwnerAndPermissionsRecursivelyWithCache(FileSystem sourceFs, 
Path fromPath,
+      Path toPath, CopyConfiguration copyConfiguration, Cache<String, 
OwnerAndPermission> permissionMap)

Review Comment:
   why would a caller use this one rather than 
`resolveReplicatedAncestorOwnerAndPermissionsRecursively` - do we need both?  
they're not exactly identical, but there seems material overlap.  and, why 
return a `Map` there but a `List` here?



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/ManifestBasedDataset.java:
##########
@@ -78,37 +81,43 @@ public Iterator<FileSet<CopyEntity>> 
getFileSetIterator(FileSystem targetFs, Cop
           + "%s, you can specify multi locations split by '',", 
manifestPath.toString(), fs.getUri().toString(), 
ManifestBasedDatasetFinder.MANIFEST_LOCATION));
     }
     CopyManifest.CopyableUnitIterator manifests = null;
-    List<CopyEntity> copyEntities = Lists.newArrayList();
-    List<FileStatus> toDelete = Lists.newArrayList();
+    List<CopyEntity> copyEntities = 
Collections.synchronizedList(Lists.newArrayList());
+    List<FileStatus> toDelete = 
Collections.synchronizedList(Lists.newArrayList());
     //todo: put permission preserve logic here?
     try {
+      long startTime = System.currentTimeMillis();
       manifests = CopyManifest.getReadIterator(this.fs, this.manifestPath);
+      Cache<String, OwnerAndPermission> permissionMap = 
CacheBuilder.newBuilder().expireAfterAccess(30, TimeUnit.SECONDS).build();

Review Comment:
   how did you arrive at 30s TTL?  seems short... I'd think more in the 
neighborhood of 5-15 mins



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to