[ https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411534&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411534 ]
ASF GitHub Bot logged work on GOBBLIN-1099: ------------------------------------------- Author: ASF GitHub Bot Created on: 28/Mar/20 00:11 Start Date: 28/Mar/20 00:11 Worklog Time Spent: 10m Work Description: sv2000 commented on pull request #2940: GOBBLIN-1099: Handle orphaned Yarn containers in Gobblin-on-Yarn clus… URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399590340 ########## File path: gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinYarnAppLauncher.java ########## @@ -479,6 +488,26 @@ void connectHelixManager() { } } + /** + * A method to disable pre-existing live instances in a Helix cluster. This can happen when a previous Yarn application + * leaves behind orphaned Yarn worker processes. Since Helix does not provide an API to drop a live instance, we use + * the disable instance API to fence off these orphaned instances and prevent them from becoming participants in the + * new cluster. + * + * NOTE: this is a workaround for an existing YARN bug. Once YARN has a fix to guarantee container kills on application + * completion, this method should be removed. + */ + void disableLiveHelixInstances() { + String clusterName = this.helixManager.getClusterName(); + HelixAdmin helixAdmin = this.helixManager.getClusterManagmentTool(); + List<String> liveInstances = HelixUtils.getLiveInstances(this.helixManager); + LOGGER.warn("Found {} live instances in the cluster.", liveInstances.size()); + for (String instanceName: liveInstances) { + LOGGER.warn("Disabling instance: {}", instanceName); + helixAdmin.enableInstance(clusterName, instanceName, false); Review comment: No there is none. The instance is part of the cluster, but disabled. It needs to re-enabled to start getting task assignments from Helix. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 411534) Time Spent: 1h 40m (was: 1.5h) > Handle orphaned Yarn containers in Gobblin-on-Yarn clusters > ----------------------------------------------------------- > > Key: GOBBLIN-1099 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1099 > Project: Apache Gobblin > Issue Type: Improvement > Components: gobblin-yarn > Affects Versions: 0.15.0 > Reporter: Sudarshan Vasudevan > Assignee: Abhishek Tiwari > Priority: Major > Fix For: 0.15.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > A Yarn application may leave behind orphaned containers, which can happen due > to lost node managers. The orphaned containers however can continue to run > (potentially forever) as participants in the Helix cluster. This can cause > the following problems for a Gobblin-on-Yarn application: > # Double publish of data and commit of state > # Task failures and partition starvation during application restarts, as > Helix may assign tasks to the orphaned containers which have a stale state > and configuration > # Container failures on application restarts due to Helix instance name > collisions with orphaned containers > -- This message was sent by Atlassian Jira (v8.3.4#803005)