[ https://issues.apache.org/jira/browse/GOBBLIN-2135?focusedWorklogId=934987&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-934987 ]
ASF GitHub Bot logged work on GOBBLIN-2135: ------------------------------------------- Author: ASF GitHub Bot Created on: 17/Sep/24 00:20 Start Date: 17/Sep/24 00:20 Worklog Time Spent: 10m Work Description: phet commented on code in PR #4030: URL: https://github.com/apache/gobblin/pull/4030#discussion_r1762107332 ########## gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnHelixUtils.java: ########## @@ -229,10 +229,11 @@ public static boolean retainKLatestJarCachePaths(Path parentCachePath, int k, Fi // Cleanup old cache if necessary List<FileStatus> jarDirs = Arrays.stream(fs.exists(parentCachePath) ? fs.listStatus(parentCachePath) : new FileStatus[0]).sorted().collect(Collectors.toList()); - if (jarDirs.size() > k) { - return fs.delete(jarDirs.get(0).getPath(), true); + boolean deletesSuccessful = true; + for (int i = 0; i < jarDirs.size() - k; i++) { + deletesSuccessful = deletesSuccessful && fs.delete(jarDirs.get(i).getPath(), true); Review Comment: `&=` ########## gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinYarnAppLauncher.java: ########## @@ -590,6 +590,15 @@ ApplicationId setupAndSubmitApplication() throws IOException, YarnException, Int amContainerLaunchContext.setEnvironment(YarnHelixUtils.getEnvironmentVariables(this.yarnConfiguration)); amContainerLaunchContext.setCommands(Lists.newArrayList(buildApplicationMasterCommand(applicationId.toString(), resource.getMemory()))); + if (this.jarCacheEnabled) { + Path jarCachePath = YarnHelixUtils.calculateJarCachePath(this.config); + // Retain at least the current and last month's jars to handle executions running for ~30 days max + boolean cleanedSuccessfully = YarnHelixUtils.retainKLatestJarCachePaths(jarCachePath.getParent(), 2, this.fs); Review Comment: would be great to add this info to a code comment here (in the caller, since it's the combination of the two being used, so less effective in either's javadoc). the crux is that retention of K=2 won't save us in cases where K=3 might exceed any FS quotas. ########## gobblin-utility/src/main/java/org/apache/gobblin/util/filesystem/HdfsJarUploadUtils.java: ########## @@ -39,15 +39,15 @@ public class HdfsJarUploadUtils { * given the {@link FileStatus} of a jarFile and the path of directory that contains jar. * Snapshot dirs should not be shared, as different jobs may be using different versions of it. * @param fs - * @param localJar + * @param localJarPath * @param unsharedJarsDir * @param jarCacheDir * @return * @throws IOException */ - public static Path calculateDestJarFile(FileSystem fs, FileStatus localJar, Path unsharedJarsDir, Path jarCacheDir) throws IOException { - Path uploadDir = localJar.getPath().getName().contains("SNAPSHOT") ? unsharedJarsDir : jarCacheDir; - Path destJarFile = new Path(fs.makeQualified(uploadDir), localJar.getPath().getName()); + public static Path calculateDestJarFilePath(FileSystem fs, String localJarPath, Path unsharedJarsDir, Path jarCacheDir) throws IOException { + Path uploadDir = localJarPath.contains("SNAPSHOT") ? unsharedJarsDir : jarCacheDir; Review Comment: seeing this invoked, `localJarPath` looks more like `localJarBasename`, not the (full) path. (or simply `jarName`) Issue Time Tracking ------------------- Worklog Id: (was: 934987) Time Spent: 2h (was: 1h 50m) > Cache Yarn jars in GobblinYarnAppLauncher > ----------------------------------------- > > Key: GOBBLIN-2135 > URL: https://issues.apache.org/jira/browse/GOBBLIN-2135 > Project: Apache Gobblin > Issue Type: Improvement > Reporter: William Lo > Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > Gobblin YARN Application Launcher lacks some functionality used in > MRJobLauncher. One of the biggest gaps in feature parity is the absence of > jar caching, where MRJobLauncher creates a monthly cache that is > automatically cleaned up by subsequent executions performed 2 months in > advance. > YARN/MR requires uploading jars to HDFS, this step can be quite slow (~15 > mins for a sizeable job to get all the jars), and given that many jobs do > share the same jars, it makes sense to cache them together and only provide > YARN the shared path. > We also want to ensure that SNAPSHOT jars are other files are not uploaded to > a cache, since they are not immutable unlike jar versions on Artifactory. -- This message was sent by Atlassian Jira (v8.20.10#820010)