[jira] [Work logged] (GOBBLIN-1704) Purge offline helix instances during startup

ASF GitHub Bot (Jira) Thu, 15 Sep 2022 15:50:13 -0700


     [ 
https://issues.apache.org/jira/browse/GOBBLIN-1704?focusedWorklogId=809335&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-809335
 ]


ASF GitHub Bot logged work on GOBBLIN-1704:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 15/Sep/22 22:49
            Start Date: 15/Sep/22 22:49
    Worklog Time Spent: 10m 
      Work Description: hanghangliu commented on code in PR #3561:
URL: https://github.com/apache/gobblin/pull/3561#discussion_r972476792


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -339,6 +342,15 @@ protected void startUp() throws Exception {
     LOGGER.info("ApplicationMaster registration response: " + response);
     this.maxResourceCapacity = 
Optional.of(response.getMaximumResourceCapability());
 
+    // All previous helix instances should be purged on startup. Gobblin task 
runners are stateless from helix
+    // perspective because all important state is persisted separately in 
Workunit State Store or Watermark store.
+    // Offline duration of 0 means any offline instance should be purged 
(Note: there aren't any online instances
+    // when this code runs, this is during startup before any containers are 
allocated).
+    LOGGER.info("Purging offline helix instances before allocating containers 
for helixClusterName={}, connectionString={}", helixManager.getClusterName(), 
helixManager.getMetadataStoreConnectionString());
+    long offlineDuration = 0;
+    this.helixAdmin.purgeOfflineInstances(this.helixManager.getClusterName(), 
offlineDuration);

Review Comment:
   We should also add a config here to make this action configurable 



##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -339,6 +342,15 @@ protected void startUp() throws Exception {
     LOGGER.info("ApplicationMaster registration response: " + response);
     this.maxResourceCapacity = 
Optional.of(response.getMaximumResourceCapability());
 
+    // All previous helix instances should be purged on startup. Gobblin task 
runners are stateless from helix
+    // perspective because all important state is persisted separately in 
Workunit State Store or Watermark store.
+    // Offline duration of 0 means any offline instance should be purged 
(Note: there aren't any online instances
+    // when this code runs, this is during startup before any containers are 
allocated).
+    LOGGER.info("Purging offline helix instances before allocating containers 
for helixClusterName={}, connectionString={}", helixManager.getClusterName(), 
helixManager.getMetadataStoreConnectionString());
+    long offlineDuration = 0;
+    this.helixAdmin.purgeOfflineInstances(this.helixManager.getClusterName(), 
offlineDuration);

Review Comment:
   Agreed that we need to avoid the race condition, according to 
[ZKHelixAdmin.purgeOfflineInstances()](https://github.com/apache/helix/blob/ee61ff434ee37d4ab8456c739b2229f99d9e0e72/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java#L270).
 
   Please do add a GTE or an alert here if this takes too long





Issue Time Tracking
-------------------

    Worklog Id:     (was: 809335)
    Time Spent: 1.5h  (was: 1h 20m)

> Purge offline helix instances during startup
> --------------------------------------------
>
>                 Key: GOBBLIN-1704
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1704
>             Project: Apache Gobblin
>          Issue Type: New Feature
>            Reporter: Matthew Ho
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (GOBBLIN-1704) Purge offline helix instances during startup

Reply via email to