[
https://issues.apache.org/jira/browse/GOBBLIN-1704?focusedWorklogId=808523&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-808523
]
ASF GitHub Bot logged work on GOBBLIN-1704:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 14/Sep/22 05:47
Start Date: 14/Sep/22 05:47
Worklog Time Spent: 10m
Work Description: homatthew commented on code in PR #3561:
URL: https://github.com/apache/gobblin/pull/3561#discussion_r970335122
##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -339,6 +342,15 @@ protected void startUp() throws Exception {
LOGGER.info("ApplicationMaster registration response: " + response);
this.maxResourceCapacity =
Optional.of(response.getMaximumResourceCapability());
+ // All previous helix instances should be purged on startup. Gobblin task
runners are stateless from helix
+ // perspective because all important state is persisted separately in
Workunit State Store or Watermark store.
+ // Offline duration of 0 means any offline instance should be purged
(Note: there aren't any online instances
+ // when this code runs, this is during startup before any containers are
allocated).
+ LOGGER.info("Purging offline helix instances before allocating containers
for helixClusterName={}, connectionString={}", helixManager.getClusterName(),
helixManager.getMetadataStoreConnectionString());
+ long offlineDuration = 0;
+ this.helixAdmin.purgeOfflineInstances(this.helixManager.getClusterName(),
offlineDuration);
Review Comment:
Not sure if we should have some sort of timeout in case the API takes a long
time to respond. Currently this a blocking call and I wonder if this call fails
we could block containers from being allocated in the YarnService (leading to
starvation).
But adding a timeout adds a potential bug / risk where we timeout and then
helix starts purging instances while we are allocating new instances (bad bug
with nondeterministic behavior)
Issue Time Tracking
-------------------
Worklog Id: (was: 808523)
Time Spent: 0.5h (was: 20m)
> Purge offline helix instances during startup
> --------------------------------------------
>
> Key: GOBBLIN-1704
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1704
> Project: Apache Gobblin
> Issue Type: New Feature
> Reporter: Matthew Ho
> Priority: Major
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)