[ 
https://issues.apache.org/jira/browse/GOBBLIN-1704?focusedWorklogId=809678&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-809678
 ]

ASF GitHub Bot logged work on GOBBLIN-1704:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Sep/22 22:17
            Start Date: 16/Sep/22 22:17
    Worklog Time Spent: 10m 
      Work Description: ZihanLi58 commented on code in PR #3561:
URL: https://github.com/apache/gobblin/pull/3561#discussion_r973475222


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -339,6 +342,15 @@ protected void startUp() throws Exception {
     LOGGER.info("ApplicationMaster registration response: " + response);
     this.maxResourceCapacity = 
Optional.of(response.getMaximumResourceCapability());
 
+    // All previous helix instances should be purged on startup. Gobblin task 
runners are stateless from helix
+    // perspective because all important state is persisted separately in 
Workunit State Store or Watermark store.
+    // Offline duration of 0 means any offline instance should be purged 
(Note: there aren't any online instances
+    // when this code runs, this is during startup before any containers are 
allocated).
+    LOGGER.info("Purging offline helix instances before allocating containers 
for helixClusterName={}, connectionString={}", helixManager.getClusterName(), 
helixManager.getMetadataStoreConnectionString());
+    long offlineDuration = 0;
+    this.helixAdmin.purgeOfflineInstances(this.helixManager.getClusterName(), 
offlineDuration);

Review Comment:
   +1 to make this behavior configurable, so that if something wrong happens on 
helix side and causing the starvation we can quickly mitigate by disable this 
behavior. Also it will be good to add instrument here to measure how long it 
usually take for us to clear the offline instance. 





Issue Time Tracking
-------------------

    Worklog Id:     (was: 809678)
    Time Spent: 1h 50m  (was: 1h 40m)

> Purge offline helix instances during startup
> --------------------------------------------
>
>                 Key: GOBBLIN-1704
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1704
>             Project: Apache Gobblin
>          Issue Type: New Feature
>            Reporter: Matthew Ho
>            Priority: Major
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to