[
https://issues.apache.org/jira/browse/GOBBLIN-1824?focusedWorklogId=859438&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-859438
]
ASF GitHub Bot logged work on GOBBLIN-1824:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 27/Apr/23 16:51
Start Date: 27/Apr/23 16:51
Worklog Time Spent: 10m
Work Description: ZihanLi58 commented on code in PR #3686:
URL: https://github.com/apache/gobblin/pull/3686#discussion_r1179450955
##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/ManifestBasedDataset.java:
##########
@@ -78,37 +81,43 @@ public Iterator<FileSet<CopyEntity>>
getFileSetIterator(FileSystem targetFs, Cop
+ "%s, you can specify multi locations split by '',",
manifestPath.toString(), fs.getUri().toString(),
ManifestBasedDatasetFinder.MANIFEST_LOCATION));
}
CopyManifest.CopyableUnitIterator manifests = null;
- List<CopyEntity> copyEntities = Lists.newArrayList();
- List<FileStatus> toDelete = Lists.newArrayList();
+ List<CopyEntity> copyEntities =
Collections.synchronizedList(Lists.newArrayList());
+ List<FileStatus> toDelete =
Collections.synchronizedList(Lists.newArrayList());
//todo: put permission preserve logic here?
try {
+ long startTime = System.currentTimeMillis();
manifests = CopyManifest.getReadIterator(this.fs, this.manifestPath);
+ Cache<String, OwnerAndPermission> permissionMap =
CacheBuilder.newBuilder().expireAfterAccess(30, TimeUnit.SECONDS).build();
Review Comment:
TTL means if we haven't tried to get it within the 30s. I tested for 30k
files, we take around 30 sec to plan. Also, given the truth or assumption that
files share the same parent usually sit near each other in the manifest file, I
think 30 sec should be enough for us. Also, this will control the memory we are
using here.
Issue Time Tracking
-------------------
Worklog Id: (was: 859438)
Time Spent: 1h 10m (was: 1h)
> Improving the Efficiency of Work Planning in Manifest-Based DistCp Jobs
> -----------------------------------------------------------------------
>
> Key: GOBBLIN-1824
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1824
> Project: Apache Gobblin
> Issue Type: Improvement
> Reporter: Zihan Li
> Priority: Major
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> Optimizing Permission Calculation and Introducing Multithreading in
> Manifest-Based DistCp Work Planning
--
This message was sent by Atlassian Jira
(v8.20.10#820010)