phet commented on code in PR #4030: URL: https://github.com/apache/gobblin/pull/4030#discussion_r1757111038
########## gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnHelixUtils.java: ########## @@ -203,16 +203,36 @@ public static void setYarnClassPath(Config config, Configuration yarnConfigurati } } - public static Path getJarPathCacheAndCleanIfNeeded(Config config, FileSystem fs) throws IOException { + /** + * Calculate the path of a jar cache on HDFS, which is retained on a monthly basis. + * @param config + * @return + * @throws IOException + */ + public static Path calculateJarCachePath(Config config) throws IOException { Review Comment: not major, but re-reading I'd almost name this `calculatePerMonthJarCachePath` ########## gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnHelixUtils.java: ########## @@ -203,16 +203,36 @@ public static void setYarnClassPath(Config config, Configuration yarnConfigurati } } - public static Path getJarPathCacheAndCleanIfNeeded(Config config, FileSystem fs) throws IOException { + /** + * Calculate the path of a jar cache on HDFS, which is retained on a monthly basis. + * @param config + * @return + * @throws IOException + */ + public static Path calculateJarCachePath(Config config) throws IOException { Path jarsCacheDirMonthly = new Path(config.getString(GobblinYarnConfigurationKeys.JAR_CACHE_DIR)); String monthSuffix = new SimpleDateFormat("yyyy-MM").format(config.getLong(GobblinYarnConfigurationKeys.YARN_APPLICATION_LAUNCHER_START_TIME_KEY)); + return new Path(jarsCacheDirMonthly, monthSuffix); + + } + + /** + * Retain the latest k jar cache paths that are children of the parent cache path. + * @param parentCachePath + * @param k the number of latest jar cache paths to retain + * @param fs + * @return + * @throws IllegalAccessException + * @throws IOException + */ + public static boolean retainKLatestJarCachePaths(Path parentCachePath, int k, FileSystem fs) throws IOException { // Cleanup old cache if necessary - List<FileStatus> jarDirs = Arrays.stream(fs.exists(jarsCacheDirMonthly) - ? fs.listStatus(jarsCacheDirMonthly) : new FileStatus[0]).sorted().collect(Collectors.toList()); - if (jarDirs.size() > 2) { - fs.delete(jarDirs.get(0).getPath(), true); + List<FileStatus> jarDirs = + Arrays.stream(fs.exists(parentCachePath) ? fs.listStatus(parentCachePath) : new FileStatus[0]).sorted().collect(Collectors.toList()); + if (jarDirs.size() > k) { + return fs.delete(jarDirs.get(0).getPath(), true); Review Comment: since this is not a loop, it seems it would delete at most one dir even if there are more than 1 more than k. is that ok? if so, document in javadoc ########## gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinYarnAppLauncher.java: ########## @@ -590,6 +590,15 @@ ApplicationId setupAndSubmitApplication() throws IOException, YarnException, Int amContainerLaunchContext.setEnvironment(YarnHelixUtils.getEnvironmentVariables(this.yarnConfiguration)); amContainerLaunchContext.setCommands(Lists.newArrayList(buildApplicationMasterCommand(applicationId.toString(), resource.getMemory()))); + if (this.jarCacheEnabled) { + Path jarCachePath = YarnHelixUtils.calculateJarCachePath(this.config); + // Retain at least the current and last month's jars to handle executions running for ~30 days max + boolean cleanedSuccessfully = YarnHelixUtils.retainKLatestJarCachePaths(jarCachePath.getParent(), 2, this.fs); Review Comment: does this run before or after caching jars? e.g. do we save only two prior AND THEN potentially add one more or we've added any new one already prior to retention paring it down to two? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@gobblin.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org