[ https://issues.apache.org/jira/browse/GOBBLIN-2135?focusedWorklogId=934347&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-934347 ]
ASF GitHub Bot logged work on GOBBLIN-2135: ------------------------------------------- Author: ASF GitHub Bot Created on: 11/Sep/24 20:09 Start Date: 11/Sep/24 20:09 Worklog Time Spent: 10m Work Description: Will-Lo commented on code in PR #4030: URL: https://github.com/apache/gobblin/pull/4030#discussion_r1755520391 ########## gobblin-temporal/src/main/java/org/apache/gobblin/temporal/yarn/YarnService.java: ########## @@ -484,12 +487,29 @@ private void requestContainer(Optional<String> preferredNode, Resource resource) protected ContainerLaunchContext newContainerLaunchContext(ContainerInfo containerInfo) throws IOException { Path appWorkDir = GobblinClusterUtils.getAppWorkDirPathFromConfig(this.config, this.fs, this.applicationName, this.applicationId); + Path containerJarsUnsharedDir = new Path(appWorkDir, GobblinYarnConfigurationKeys.CONTAINER_WORK_DIR_NAME); + Path jarCacheDir = this.jarCacheEnabled ? YarnHelixUtils.getJarPathCacheAndCleanIfNeeded(this.config, this.fs) : appWorkDir; + Path containerJarsCachedDir = new Path(jarCacheDir, GobblinYarnConfigurationKeys.CONTAINER_WORK_DIR_NAME); + LOGGER.info("Container cached jars root dir: " + containerJarsCachedDir); + LOGGER.info("Container uncached jars root dir: " + containerJarsUnsharedDir); Path containerWorkDir = new Path(appWorkDir, GobblinYarnConfigurationKeys.CONTAINER_WORK_DIR_NAME); - Map<String, LocalResource> resourceMap = Maps.newHashMap(); + Map<String, LocalResource> resourceMap = Maps.newHashMap(); + // Always fetch any jars from the appWorkDir for any potential snapshot jars addContainerLocalResources(new Path(appWorkDir, GobblinYarnConfigurationKeys.LIB_JARS_DIR_NAME), resourceMap); - addContainerLocalResources(new Path(containerWorkDir, GobblinYarnConfigurationKeys.APP_JARS_DIR_NAME), resourceMap); + if (this.config.hasPath(GobblinYarnConfigurationKeys.CONTAINER_FILES_LOCAL_KEY)) { + addContainerLocalResources(new Path(containerJarsUnsharedDir, GobblinYarnConfigurationKeys.APP_JARS_DIR_NAME), + resourceMap); + } + if (this.jarCacheEnabled) { + addContainerLocalResources(new Path(jarCacheDir, GobblinYarnConfigurationKeys.LIB_JARS_DIR_NAME), resourceMap); + if (this.config.hasPath(GobblinYarnConfigurationKeys.CONTAINER_FILES_LOCAL_KEY)) { Review Comment: Oops should have used `gobblin.yarn.container.jars` key Issue Time Tracking ------------------- Worklog Id: (was: 934347) Time Spent: 40m (was: 0.5h) > Cache Yarn jars in GobblinYarnAppLauncher > ----------------------------------------- > > Key: GOBBLIN-2135 > URL: https://issues.apache.org/jira/browse/GOBBLIN-2135 > Project: Apache Gobblin > Issue Type: Improvement > Reporter: William Lo > Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Gobblin YARN Application Launcher lacks some functionality used in > MRJobLauncher. One of the biggest gaps in feature parity is the absence of > jar caching, where MRJobLauncher creates a monthly cache that is > automatically cleaned up by subsequent executions performed 2 months in > advance. > YARN/MR requires uploading jars to HDFS, this step can be quite slow (~15 > mins for a sizeable job to get all the jars), and given that many jobs do > share the same jars, it makes sense to cache them together and only provide > YARN the shared path. > We also want to ensure that SNAPSHOT jars are other files are not uploaded to > a cache, since they are not immutable unlike jar versions on Artifactory. -- This message was sent by Atlassian Jira (v8.20.10#820010)