[ 
https://issues.apache.org/jira/browse/GOBBLIN-2135?focusedWorklogId=934555&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-934555
 ]

ASF GitHub Bot logged work on GOBBLIN-2135:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/Sep/24 15:30
            Start Date: 12/Sep/24 15:30
    Worklog Time Spent: 10m 
      Work Description: phet commented on code in PR #4030:
URL: https://github.com/apache/gobblin/pull/4030#discussion_r1757111038


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnHelixUtils.java:
##########
@@ -203,16 +203,36 @@ public static void setYarnClassPath(Config config, 
Configuration yarnConfigurati
     }
   }
 
-  public static Path getJarPathCacheAndCleanIfNeeded(Config config, FileSystem 
fs) throws IOException {
+  /**
+   * Calculate the path of a jar cache on HDFS, which is retained on a monthly 
basis.
+   * @param config
+   * @return
+   * @throws IOException
+   */
+  public static Path calculateJarCachePath(Config config) throws IOException {

Review Comment:
   not major, but re-reading I'd almost name this 
`calculatePerMonthJarCachePath`



##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnHelixUtils.java:
##########
@@ -203,16 +203,36 @@ public static void setYarnClassPath(Config config, 
Configuration yarnConfigurati
     }
   }
 
-  public static Path getJarPathCacheAndCleanIfNeeded(Config config, FileSystem 
fs) throws IOException {
+  /**
+   * Calculate the path of a jar cache on HDFS, which is retained on a monthly 
basis.
+   * @param config
+   * @return
+   * @throws IOException
+   */
+  public static Path calculateJarCachePath(Config config) throws IOException {
     Path jarsCacheDirMonthly = new 
Path(config.getString(GobblinYarnConfigurationKeys.JAR_CACHE_DIR));
     String monthSuffix = new 
SimpleDateFormat("yyyy-MM").format(config.getLong(GobblinYarnConfigurationKeys.YARN_APPLICATION_LAUNCHER_START_TIME_KEY));
+    return new Path(jarsCacheDirMonthly, monthSuffix);
+
+  }
+
+  /**
+   * Retain the latest k jar cache paths that are children of the parent cache 
path.
+   * @param parentCachePath
+   * @param k the number of latest jar cache paths to retain
+   * @param fs
+   * @return
+   * @throws IllegalAccessException
+   * @throws IOException
+   */
+  public static boolean retainKLatestJarCachePaths(Path parentCachePath, int 
k, FileSystem fs) throws IOException {
     // Cleanup old cache if necessary
-    List<FileStatus> jarDirs = Arrays.stream(fs.exists(jarsCacheDirMonthly)
-        ? fs.listStatus(jarsCacheDirMonthly) : new 
FileStatus[0]).sorted().collect(Collectors.toList());
-    if (jarDirs.size() > 2) {
-      fs.delete(jarDirs.get(0).getPath(), true);
+    List<FileStatus> jarDirs =
+        Arrays.stream(fs.exists(parentCachePath) ? 
fs.listStatus(parentCachePath) : new 
FileStatus[0]).sorted().collect(Collectors.toList());
+    if (jarDirs.size() > k) {
+      return fs.delete(jarDirs.get(0).getPath(), true);

Review Comment:
   since this is not a loop, it seems it would delete at most one dir even if 
there are more than 1 more than k.  is that ok?  if so, document in javadoc



##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinYarnAppLauncher.java:
##########
@@ -590,6 +590,15 @@ ApplicationId setupAndSubmitApplication() throws 
IOException, YarnException, Int
     
amContainerLaunchContext.setEnvironment(YarnHelixUtils.getEnvironmentVariables(this.yarnConfiguration));
     
amContainerLaunchContext.setCommands(Lists.newArrayList(buildApplicationMasterCommand(applicationId.toString(),
 resource.getMemory())));
 
+    if (this.jarCacheEnabled) {
+      Path jarCachePath = YarnHelixUtils.calculateJarCachePath(this.config);
+      // Retain at least the current and last month's jars to handle 
executions running for ~30 days max
+      boolean cleanedSuccessfully = 
YarnHelixUtils.retainKLatestJarCachePaths(jarCachePath.getParent(), 2, this.fs);

Review Comment:
   does this run before or after caching jars?  e.g. do we save only two prior 
AND THEN potentially add one more or we've added any new one already prior to 
retention paring it down to two?





Issue Time Tracking
-------------------

    Worklog Id:     (was: 934555)
    Time Spent: 1h 20m  (was: 1h 10m)

> Cache Yarn jars in GobblinYarnAppLauncher
> -----------------------------------------
>
>                 Key: GOBBLIN-2135
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-2135
>             Project: Apache Gobblin
>          Issue Type: Improvement
>            Reporter: William Lo
>            Priority: Major
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Gobblin YARN Application Launcher lacks some functionality used in 
> MRJobLauncher. One of the biggest gaps in feature parity is the absence of 
> jar caching, where MRJobLauncher creates a monthly cache that is 
> automatically cleaned up by subsequent executions performed 2 months in 
> advance.
> YARN/MR requires uploading jars to HDFS, this step can be quite slow (~15 
> mins for a sizeable job to get all the jars), and given that many jobs do 
> share the same jars, it makes sense to cache them together and only provide 
> YARN the shared path. 
> We also want to ensure that SNAPSHOT jars are other files are not uploaded to 
> a cache, since they are not immutable unlike jar versions on Artifactory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to