[
https://issues.apache.org/jira/browse/GOBBLIN-1704?focusedWorklogId=809213&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-809213
]
ASF GitHub Bot logged work on GOBBLIN-1704:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 15/Sep/22 17:05
Start Date: 15/Sep/22 17:05
Worklog Time Spent: 10m
Work Description: umustafi commented on code in PR #3561:
URL: https://github.com/apache/gobblin/pull/3561#discussion_r972235376
##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -339,6 +342,15 @@ protected void startUp() throws Exception {
LOGGER.info("ApplicationMaster registration response: " + response);
this.maxResourceCapacity =
Optional.of(response.getMaximumResourceCapability());
+ // All previous helix instances should be purged on startup. Gobblin task
runners are stateless from helix
+ // perspective because all important state is persisted separately in
Workunit State Store or Watermark store.
+ // Offline duration of 0 means any offline instance should be purged
(Note: there aren't any online instances
+ // when this code runs, this is during startup before any containers are
allocated).
+ LOGGER.info("Purging offline helix instances before allocating containers
for helixClusterName={}, connectionString={}", helixManager.getClusterName(),
helixManager.getMetadataStoreConnectionString());
+ long offlineDuration = 0;
+ this.helixAdmin.purgeOfflineInstances(this.helixManager.getClusterName(),
offlineDuration);
Review Comment:
We should block on this call as you're saying so we don't start allocating
new instances while purging his happening in a different thread. However, we
don't want the whole `YarnService` to fail initialization because of a Helix
issue. We should block and have a timeout. Then if it fails, emit a **metric**
(in addition to log) that Helix instances have failed to be purged. We should
set alerts on this metric so oncall can respond to this metric and investigate
failure to purge Helix instances so they don't grow unbounded. With a log
warning/error it's easy to get ignored and miss it until there is a GCN level
issue.
Issue Time Tracking
-------------------
Worklog Id: (was: 809213)
Time Spent: 40m (was: 0.5h)
> Purge offline helix instances during startup
> --------------------------------------------
>
> Key: GOBBLIN-1704
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1704
> Project: Apache Gobblin
> Issue Type: New Feature
> Reporter: Matthew Ho
> Priority: Major
> Time Spent: 40m
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)