hanghangliu commented on code in PR #3561:
URL: https://github.com/apache/gobblin/pull/3561#discussion_r972476792


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -339,6 +342,15 @@ protected void startUp() throws Exception {
     LOGGER.info("ApplicationMaster registration response: " + response);
     this.maxResourceCapacity = 
Optional.of(response.getMaximumResourceCapability());
 
+    // All previous helix instances should be purged on startup. Gobblin task 
runners are stateless from helix
+    // perspective because all important state is persisted separately in 
Workunit State Store or Watermark store.
+    // Offline duration of 0 means any offline instance should be purged 
(Note: there aren't any online instances
+    // when this code runs, this is during startup before any containers are 
allocated).
+    LOGGER.info("Purging offline helix instances before allocating containers 
for helixClusterName={}, connectionString={}", helixManager.getClusterName(), 
helixManager.getMetadataStoreConnectionString());
+    long offlineDuration = 0;
+    this.helixAdmin.purgeOfflineInstances(this.helixManager.getClusterName(), 
offlineDuration);

Review Comment:
   We should also add a config here to make this action configurable 



##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -339,6 +342,15 @@ protected void startUp() throws Exception {
     LOGGER.info("ApplicationMaster registration response: " + response);
     this.maxResourceCapacity = 
Optional.of(response.getMaximumResourceCapability());
 
+    // All previous helix instances should be purged on startup. Gobblin task 
runners are stateless from helix
+    // perspective because all important state is persisted separately in 
Workunit State Store or Watermark store.
+    // Offline duration of 0 means any offline instance should be purged 
(Note: there aren't any online instances
+    // when this code runs, this is during startup before any containers are 
allocated).
+    LOGGER.info("Purging offline helix instances before allocating containers 
for helixClusterName={}, connectionString={}", helixManager.getClusterName(), 
helixManager.getMetadataStoreConnectionString());
+    long offlineDuration = 0;
+    this.helixAdmin.purgeOfflineInstances(this.helixManager.getClusterName(), 
offlineDuration);

Review Comment:
   Agreed that we need to avoid the race condition, according to 
[ZKHelixAdmin.purgeOfflineInstances()](https://github.com/apache/helix/blob/ee61ff434ee37d4ab8456c739b2229f99d9e0e72/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java#L270).
 
   Please do add a GTE or an alert here if this takes too long



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to