[ 
https://issues.apache.org/jira/browse/GOBBLIN-1824?focusedWorklogId=859438&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-859438
 ]

ASF GitHub Bot logged work on GOBBLIN-1824:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 27/Apr/23 16:51
            Start Date: 27/Apr/23 16:51
    Worklog Time Spent: 10m 
      Work Description: ZihanLi58 commented on code in PR #3686:
URL: https://github.com/apache/gobblin/pull/3686#discussion_r1179450955


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/ManifestBasedDataset.java:
##########
@@ -78,37 +81,43 @@ public Iterator<FileSet<CopyEntity>> 
getFileSetIterator(FileSystem targetFs, Cop
           + "%s, you can specify multi locations split by '',", 
manifestPath.toString(), fs.getUri().toString(), 
ManifestBasedDatasetFinder.MANIFEST_LOCATION));
     }
     CopyManifest.CopyableUnitIterator manifests = null;
-    List<CopyEntity> copyEntities = Lists.newArrayList();
-    List<FileStatus> toDelete = Lists.newArrayList();
+    List<CopyEntity> copyEntities = 
Collections.synchronizedList(Lists.newArrayList());
+    List<FileStatus> toDelete = 
Collections.synchronizedList(Lists.newArrayList());
     //todo: put permission preserve logic here?
     try {
+      long startTime = System.currentTimeMillis();
       manifests = CopyManifest.getReadIterator(this.fs, this.manifestPath);
+      Cache<String, OwnerAndPermission> permissionMap = 
CacheBuilder.newBuilder().expireAfterAccess(30, TimeUnit.SECONDS).build();

Review Comment:
   TTL means if we haven't tried to get it within the 30s. I tested for 30k 
files, we take around 30 sec to plan. Also, given the truth or assumption that 
files share the same parent usually sit near each other in the manifest file, I 
think 30 sec should be enough for us. Also, this will control the memory we are 
using here. 





Issue Time Tracking
-------------------

    Worklog Id:     (was: 859438)
    Time Spent: 1h 10m  (was: 1h)

> Improving the Efficiency of Work Planning in Manifest-Based DistCp Jobs
> -----------------------------------------------------------------------
>
>                 Key: GOBBLIN-1824
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1824
>             Project: Apache Gobblin
>          Issue Type: Improvement
>            Reporter: Zihan Li
>            Priority: Major
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Optimizing Permission Calculation and Introducing Multithreading in 
> Manifest-Based DistCp Work Planning



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to