phet commented on code in PR #3686:
URL: https://github.com/apache/gobblin/pull/3686#discussion_r1178773725
##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/ManifestBasedDataset.java:
##########
@@ -117,6 +126,7 @@ public Iterator<FileSet<CopyEntity>>
getFileSetIterator(FileSystem targetFs, Cop
CommitStep step = new DeleteFileCommitStep(targetFs, toDelete,
this.properties, Optional.<Path>absent());
copyEntities.add(new PrePublishStep(datasetURN(), Maps.newHashMap(),
step, 1));
}
+ log.info(String.format("Workunits calculation take %s milliseconds to
process %s files", System.currentTimeMillis() - startTime, numFiles));
Review Comment:
nit: "take" => "took"
BTW, I really like this timing! do we want a metric in addition to a logged
line?
##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/ManifestBasedDataset.java:
##########
@@ -78,37 +81,43 @@ public Iterator<FileSet<CopyEntity>>
getFileSetIterator(FileSystem targetFs, Cop
+ "%s, you can specify multi locations split by '',",
manifestPath.toString(), fs.getUri().toString(),
ManifestBasedDatasetFinder.MANIFEST_LOCATION));
}
CopyManifest.CopyableUnitIterator manifests = null;
- List<CopyEntity> copyEntities = Lists.newArrayList();
- List<FileStatus> toDelete = Lists.newArrayList();
+ List<CopyEntity> copyEntities =
Collections.synchronizedList(Lists.newArrayList());
+ List<FileStatus> toDelete =
Collections.synchronizedList(Lists.newArrayList());
Review Comment:
do these need to be synchronized?
##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/CopyableFile.java:
##########
@@ -375,6 +379,35 @@ public static List<OwnerAndPermission>
resolveReplicatedOwnerAndPermissionsRecur
return ownerAndPermissions;
}
+ /**
+ * Compute the correct {@link OwnerAndPermission} obtained from replicating
source owner and permissions and applying
+ * the {@link PreserveAttributes} rules for fromPath and every ancestor up
to but excluding toPath.
+ * Use permissionMap as a cache to reduce the call to hdfs
+ *
+ * @return A list of the computed {@link OwnerAndPermission}s starting from
fromPath, up to but excluding toPath.
+ * @throws IOException if toPath is not an ancestor of fromPath.
+ */
+ public static List<OwnerAndPermission>
resolveReplicatedOwnerAndPermissionsRecursivelyWithCache(FileSystem sourceFs,
Path fromPath,
+ Path toPath, CopyConfiguration copyConfiguration, Cache<String,
OwnerAndPermission> permissionMap)
Review Comment:
why would a caller use this one rather than
`resolveReplicatedAncestorOwnerAndPermissionsRecursively` - do we need both?
they're not exactly identical, but there seems material overlap. and, why
return a `Map` there but a `List` here?
##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/ManifestBasedDataset.java:
##########
@@ -78,37 +81,43 @@ public Iterator<FileSet<CopyEntity>>
getFileSetIterator(FileSystem targetFs, Cop
+ "%s, you can specify multi locations split by '',",
manifestPath.toString(), fs.getUri().toString(),
ManifestBasedDatasetFinder.MANIFEST_LOCATION));
}
CopyManifest.CopyableUnitIterator manifests = null;
- List<CopyEntity> copyEntities = Lists.newArrayList();
- List<FileStatus> toDelete = Lists.newArrayList();
+ List<CopyEntity> copyEntities =
Collections.synchronizedList(Lists.newArrayList());
+ List<FileStatus> toDelete =
Collections.synchronizedList(Lists.newArrayList());
//todo: put permission preserve logic here?
try {
+ long startTime = System.currentTimeMillis();
manifests = CopyManifest.getReadIterator(this.fs, this.manifestPath);
+ Cache<String, OwnerAndPermission> permissionMap =
CacheBuilder.newBuilder().expireAfterAccess(30, TimeUnit.SECONDS).build();
Review Comment:
how did you arrive at 30s TTL? seems short... I'd think more in the
neighborhood of 5-15 mins
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]