ZihanLi58 commented on code in PR #3561: URL: https://github.com/apache/gobblin/pull/3561#discussion_r973475222
########## gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java: ########## @@ -339,6 +342,15 @@ protected void startUp() throws Exception { LOGGER.info("ApplicationMaster registration response: " + response); this.maxResourceCapacity = Optional.of(response.getMaximumResourceCapability()); + // All previous helix instances should be purged on startup. Gobblin task runners are stateless from helix + // perspective because all important state is persisted separately in Workunit State Store or Watermark store. + // Offline duration of 0 means any offline instance should be purged (Note: there aren't any online instances + // when this code runs, this is during startup before any containers are allocated). + LOGGER.info("Purging offline helix instances before allocating containers for helixClusterName={}, connectionString={}", helixManager.getClusterName(), helixManager.getMetadataStoreConnectionString()); + long offlineDuration = 0; + this.helixAdmin.purgeOfflineInstances(this.helixManager.getClusterName(), offlineDuration); Review Comment: +1 to make this behavior configurable, so that if something wrong happens on helix side and causing the starvation we can quickly mitigate by disable this behavior. Also it will be good to add instrument here to measure how long it usually take for us to clear the offline instance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@gobblin.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org