autumnust commented on a change in pull request #2940: GOBBLIN-1099: Handle 
orphaned Yarn containers in Gobblin-on-Yarn clus…
URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399584405
 
 

 ##########
 File path: 
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinYarnAppLauncher.java
 ##########
 @@ -479,6 +488,26 @@ void connectHelixManager() {
     }
   }
 
+  /**
+   * A method to disable pre-existing live instances in a Helix cluster. This 
can happen when a previous Yarn application
+   * leaves behind orphaned Yarn worker processes. Since Helix does not 
provide an API to drop a live instance, we use
+   * the disable instance API to fence off these orphaned instances and 
prevent them from becoming participants in the
+   * new cluster.
+   *
+   * NOTE: this is a workaround for an existing YARN bug. Once YARN has a fix 
to guarantee container kills on application
+   * completion, this method should be removed.
+   */
+  void disableLiveHelixInstances() {
+    String clusterName = this.helixManager.getClusterName();
+    HelixAdmin helixAdmin = this.helixManager.getClusterManagmentTool();
+    List<String> liveInstances = 
HelixUtils.getLiveInstances(this.helixManager);
+    LOGGER.warn("Found {} live instances in the cluster.", 
liveInstances.size());
+    for (String instanceName: liveInstances) {
+      LOGGER.warn("Disabling instance: {}", instanceName);
+      helixAdmin.enableInstance(clusterName, instanceName, false);
 
 Review comment:
   Are there any mechanism from Helix side to sort of retry-joining cluster 
after being disabled? Just want to make sure

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to