[ 
https://issues.apache.org/jira/browse/GOBBLIN-1704?focusedWorklogId=809250&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-809250
 ]

ASF GitHub Bot logged work on GOBBLIN-1704:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 15/Sep/22 18:51
            Start Date: 15/Sep/22 18:51
    Worklog Time Spent: 10m 
      Work Description: homatthew commented on code in PR #3561:
URL: https://github.com/apache/gobblin/pull/3561#discussion_r972323434


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -339,6 +342,15 @@ protected void startUp() throws Exception {
     LOGGER.info("ApplicationMaster registration response: " + response);
     this.maxResourceCapacity = 
Optional.of(response.getMaximumResourceCapability());
 
+    // All previous helix instances should be purged on startup. Gobblin task 
runners are stateless from helix
+    // perspective because all important state is persisted separately in 
Workunit State Store or Watermark store.
+    // Offline duration of 0 means any offline instance should be purged 
(Note: there aren't any online instances
+    // when this code runs, this is during startup before any containers are 
allocated).
+    LOGGER.info("Purging offline helix instances before allocating containers 
for helixClusterName={}, connectionString={}", helixManager.getClusterName(), 
helixManager.getMetadataStoreConnectionString());
+    long offlineDuration = 0;
+    this.helixAdmin.purgeOfflineInstances(this.helixManager.getClusterName(), 
offlineDuration);

Review Comment:
   Well kind of. As mentioned
   
   > But adding a timeout adds a potential bug / risk where we timeout and then 
helix starts purging instances while we are allocating new instances (bad bug 
with nondeterministic behavior)
   
   We put ourselves at risk of having a live instances without an INSTANCE 
config. That would essentially bork the live instance and have a domino effect. 
Although I do think we should send a GTE if this call is taking too long 





Issue Time Tracking
-------------------

    Worklog Id:     (was: 809250)
    Time Spent: 1h  (was: 50m)

> Purge offline helix instances during startup
> --------------------------------------------
>
>                 Key: GOBBLIN-1704
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1704
>             Project: Apache Gobblin
>          Issue Type: New Feature
>            Reporter: Matthew Ho
>            Priority: Major
>          Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to