[
https://issues.apache.org/jira/browse/GOBBLIN-1704?focusedWorklogId=809251&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-809251
]
ASF GitHub Bot logged work on GOBBLIN-1704:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 15/Sep/22 18:52
Start Date: 15/Sep/22 18:52
Worklog Time Spent: 10m
Work Description: homatthew commented on code in PR #3561:
URL: https://github.com/apache/gobblin/pull/3561#discussion_r972323434
##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -339,6 +342,15 @@ protected void startUp() throws Exception {
LOGGER.info("ApplicationMaster registration response: " + response);
this.maxResourceCapacity =
Optional.of(response.getMaximumResourceCapability());
+ // All previous helix instances should be purged on startup. Gobblin task
runners are stateless from helix
+ // perspective because all important state is persisted separately in
Workunit State Store or Watermark store.
+ // Offline duration of 0 means any offline instance should be purged
(Note: there aren't any online instances
+ // when this code runs, this is during startup before any containers are
allocated).
+ LOGGER.info("Purging offline helix instances before allocating containers
for helixClusterName={}, connectionString={}", helixManager.getClusterName(),
helixManager.getMetadataStoreConnectionString());
+ long offlineDuration = 0;
+ this.helixAdmin.purgeOfflineInstances(this.helixManager.getClusterName(),
offlineDuration);
Review Comment:
Well kind of. As mentioned
> But adding a timeout adds a potential bug / risk where we timeout and then
helix starts purging instances while we are allocating new instances (bad bug
with nondeterministic behavior)
We put ourselves at risk of having a live instances without an INSTANCE
config. That would essentially bork the live instance and I don't know that the
system would auto recover. Although I do think we should send a GTE if this
call is taking too long
Issue Time Tracking
-------------------
Worklog Id: (was: 809251)
Time Spent: 1h 10m (was: 1h)
> Purge offline helix instances during startup
> --------------------------------------------
>
> Key: GOBBLIN-1704
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1704
> Project: Apache Gobblin
> Issue Type: New Feature
> Reporter: Matthew Ho
> Priority: Major
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)