[ 
https://issues.apache.org/jira/browse/GOBBLIN-2135?focusedWorklogId=934573&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-934573
 ]

ASF GitHub Bot logged work on GOBBLIN-2135:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/Sep/24 17:19
            Start Date: 12/Sep/24 17:19
    Worklog Time Spent: 10m 
      Work Description: Will-Lo commented on code in PR #4030:
URL: https://github.com/apache/gobblin/pull/4030#discussion_r1757279937


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinYarnAppLauncher.java:
##########
@@ -590,6 +590,15 @@ ApplicationId setupAndSubmitApplication() throws 
IOException, YarnException, Int
     
amContainerLaunchContext.setEnvironment(YarnHelixUtils.getEnvironmentVariables(this.yarnConfiguration));
     
amContainerLaunchContext.setCommands(Lists.newArrayList(buildApplicationMasterCommand(applicationId.toString(),
 resource.getMemory())));
 
+    if (this.jarCacheEnabled) {
+      Path jarCachePath = YarnHelixUtils.calculateJarCachePath(this.config);
+      // Retain at least the current and last month's jars to handle 
executions running for ~30 days max
+      boolean cleanedSuccessfully = 
YarnHelixUtils.retainKLatestJarCachePaths(jarCachePath.getParent(), 2, this.fs);

Review Comment:
   It runs after caching the jars. But it uses a consistent 
`YARN_APPLICATION_LAUNCHER_START_TIME_KEY` in the job so no matter how many 
times we look at the cache path it's only creating one path at most, and that 
path would be the ones where the jars are being uploaded.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 934573)
    Time Spent: 1.5h  (was: 1h 20m)

> Cache Yarn jars in GobblinYarnAppLauncher
> -----------------------------------------
>
>                 Key: GOBBLIN-2135
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-2135
>             Project: Apache Gobblin
>          Issue Type: Improvement
>            Reporter: William Lo
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Gobblin YARN Application Launcher lacks some functionality used in 
> MRJobLauncher. One of the biggest gaps in feature parity is the absence of 
> jar caching, where MRJobLauncher creates a monthly cache that is 
> automatically cleaned up by subsequent executions performed 2 months in 
> advance.
> YARN/MR requires uploading jars to HDFS, this step can be quite slow (~15 
> mins for a sizeable job to get all the jars), and given that many jobs do 
> share the same jars, it makes sense to cache them together and only provide 
> YARN the shared path. 
> We also want to ensure that SNAPSHOT jars are other files are not uploaded to 
> a cache, since they are not immutable unlike jar versions on Artifactory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to