[ 
https://issues.apache.org/jira/browse/GOBBLIN-2135?focusedWorklogId=934585&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-934585
 ]

ASF GitHub Bot logged work on GOBBLIN-2135:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/Sep/24 18:07
            Start Date: 12/Sep/24 18:07
    Worklog Time Spent: 10m 
      Work Description: Will-Lo commented on code in PR #4030:
URL: https://github.com/apache/gobblin/pull/4030#discussion_r1757344912


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnHelixUtils.java:
##########
@@ -203,16 +203,36 @@ public static void setYarnClassPath(Config config, 
Configuration yarnConfigurati
     }
   }
 
-  public static Path getJarPathCacheAndCleanIfNeeded(Config config, FileSystem 
fs) throws IOException {
+  /**
+   * Calculate the path of a jar cache on HDFS, which is retained on a monthly 
basis.
+   * @param config
+   * @return
+   * @throws IOException
+   */
+  public static Path calculateJarCachePath(Config config) throws IOException {
     Path jarsCacheDirMonthly = new 
Path(config.getString(GobblinYarnConfigurationKeys.JAR_CACHE_DIR));
     String monthSuffix = new 
SimpleDateFormat("yyyy-MM").format(config.getLong(GobblinYarnConfigurationKeys.YARN_APPLICATION_LAUNCHER_START_TIME_KEY));
+    return new Path(jarsCacheDirMonthly, monthSuffix);
+
+  }
+
+  /**
+   * Retain the latest k jar cache paths that are children of the parent cache 
path.
+   * @param parentCachePath
+   * @param k the number of latest jar cache paths to retain
+   * @param fs
+   * @return
+   * @throws IllegalAccessException
+   * @throws IOException
+   */
+  public static boolean retainKLatestJarCachePaths(Path parentCachePath, int 
k, FileSystem fs) throws IOException {
     // Cleanup old cache if necessary
-    List<FileStatus> jarDirs = Arrays.stream(fs.exists(jarsCacheDirMonthly)
-        ? fs.listStatus(jarsCacheDirMonthly) : new 
FileStatus[0]).sorted().collect(Collectors.toList());
-    if (jarDirs.size() > 2) {
-      fs.delete(jarDirs.get(0).getPath(), true);
+    List<FileStatus> jarDirs =
+        Arrays.stream(fs.exists(parentCachePath) ? 
fs.listStatus(parentCachePath) : new 
FileStatus[0]).sorted().collect(Collectors.toList());
+    if (jarDirs.size() > k) {
+      return fs.delete(jarDirs.get(0).getPath(), true);

Review Comment:
   Changed it to use a loop, for consistency with naming convention. 





Issue Time Tracking
-------------------

    Worklog Id:     (was: 934585)
    Time Spent: 1h 50m  (was: 1h 40m)

> Cache Yarn jars in GobblinYarnAppLauncher
> -----------------------------------------
>
>                 Key: GOBBLIN-2135
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-2135
>             Project: Apache Gobblin
>          Issue Type: Improvement
>            Reporter: William Lo
>            Priority: Major
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Gobblin YARN Application Launcher lacks some functionality used in 
> MRJobLauncher. One of the biggest gaps in feature parity is the absence of 
> jar caching, where MRJobLauncher creates a monthly cache that is 
> automatically cleaned up by subsequent executions performed 2 months in 
> advance.
> YARN/MR requires uploading jars to HDFS, this step can be quite slow (~15 
> mins for a sizeable job to get all the jars), and given that many jobs do 
> share the same jars, it makes sense to cache them together and only provide 
> YARN the shared path. 
> We also want to ensure that SNAPSHOT jars are other files are not uploaded to 
> a cache, since they are not immutable unlike jar versions on Artifactory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to