[ 
https://issues.apache.org/jira/browse/GOBBLIN-1099?focusedWorklogId=411565&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-411565
 ]

ASF GitHub Bot logged work on GOBBLIN-1099:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 28/Mar/20 00:34
            Start Date: 28/Mar/20 00:34
    Worklog Time Spent: 10m 
      Work Description: sv2000 commented on pull request #2940: GOBBLIN-1099: 
Handle orphaned Yarn containers in Gobblin-on-Yarn clus…
URL: https://github.com/apache/incubator-gobblin/pull/2940#discussion_r399594383
 
 

 ##########
 File path: 
gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinTaskRunner.java
 ##########
 @@ -366,20 +377,74 @@ boolean isStopped() {
   }
 
   @VisibleForTesting
-  void connectHelixManager() {
-    try {
-      this.jobHelixManager.connect();
-      this.jobHelixManager.getMessagingService()
-          
.registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE,
-              new ParticipantShutdownMessageHandlerFactory());
-      this.jobHelixManager.getMessagingService()
-          
.registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(),
-              getUserDefinedMessageHandlerFactory());
-      if (this.taskDriverHelixManager.isPresent()) {
-        this.taskDriverHelixManager.get().connect();
+  void connectHelixManager() throws Exception {
+    this.jobHelixManager.connect();
+    //Ensure the instance is enabled.
+    this.jobHelixManager.getClusterManagmentTool().enableInstance(clusterName, 
helixInstanceName, true);
+    this.jobHelixManager.getMessagingService()
+        
.registerMessageHandlerFactory(GobblinHelixConstants.SHUTDOWN_MESSAGE_TYPE,
+            new ParticipantShutdownMessageHandlerFactory());
+    this.jobHelixManager.getMessagingService()
+        
.registerMessageHandlerFactory(Message.MessageType.USER_DEFINE_MSG.toString(),
+            getUserDefinedMessageHandlerFactory());
+    if (this.taskDriverHelixManager.isPresent()) {
+      this.taskDriverHelixManager.get().connect();
+      //Ensure the instance is enabled.
+      
this.taskDriverHelixManager.get().getClusterManagmentTool().enableInstance(this.taskDriverHelixManager.get().getClusterName(),
 helixInstanceName, true);
+    }
+  }
+
+  /**
+   * A method to handle failures joining Helix cluster. The method will 
perform the following steps before attempting
+   * to re-join the cluster:
+   * <li>
+   *   <ul>Disconnect from Helix cluster, which would close any open 
clients</ul>
+   *   <ul>Drop instance from Helix cluster, to remove any partial instance 
structure from Helix</ul>
+   *   <ul>Re-construct helix manager instances, used to re-join the 
cluster</ul>
+   * </li>
+   */
+  private void onClusterJoinFailure() {
+    logger.warn("Disconnecting Helix manager..");
+    disconnectHelixManager();
+
+    HelixAdmin admin = new 
ZKHelixAdmin(clusterConfig.getString(GobblinClusterConfigurationKeys.ZK_CONNECTION_STRING_KEY));
+    //Drop the helix Instance
+    logger.warn("Dropping instance: {} from cluster: {}", helixInstanceName, 
clusterName);
+    HelixUtils.dropInstanceIfExists(admin, clusterName, helixInstanceName);
 
 Review comment:
   Changed it to return void. The dropInstanceIfExists method is intended to 
swallow HelixException that can only occur due to instance path not being 
present i.e instance does not exist. 
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 411565)
    Time Spent: 2h 20m  (was: 2h 10m)

> Handle orphaned Yarn containers in Gobblin-on-Yarn clusters
> -----------------------------------------------------------
>
>                 Key: GOBBLIN-1099
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1099
>             Project: Apache Gobblin
>          Issue Type: Improvement
>          Components: gobblin-yarn
>    Affects Versions: 0.15.0
>            Reporter: Sudarshan Vasudevan
>            Assignee: Abhishek Tiwari
>            Priority: Major
>             Fix For: 0.15.0
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> A Yarn application may leave behind orphaned containers, which can happen due 
> to lost node managers. The orphaned containers however can continue to run 
> (potentially forever) as participants in the Helix cluster. This can cause 
> the following problems for a Gobblin-on-Yarn application:
>  # Double publish of data and commit of state
>  # Task failures and partition starvation during application restarts, as 
> Helix may assign tasks to the orphaned containers which have a stale state 
> and configuration
>  # Container failures on application restarts due to Helix instance name 
> collisions with orphaned containers
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to