[
https://issues.apache.org/jira/browse/GOBBLIN-1704?focusedWorklogId=809335&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-809335
]
ASF GitHub Bot logged work on GOBBLIN-1704:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 15/Sep/22 22:49
Start Date: 15/Sep/22 22:49
Worklog Time Spent: 10m
Work Description: hanghangliu commented on code in PR #3561:
URL: https://github.com/apache/gobblin/pull/3561#discussion_r972476792
##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -339,6 +342,15 @@ protected void startUp() throws Exception {
LOGGER.info("ApplicationMaster registration response: " + response);
this.maxResourceCapacity =
Optional.of(response.getMaximumResourceCapability());
+ // All previous helix instances should be purged on startup. Gobblin task
runners are stateless from helix
+ // perspective because all important state is persisted separately in
Workunit State Store or Watermark store.
+ // Offline duration of 0 means any offline instance should be purged
(Note: there aren't any online instances
+ // when this code runs, this is during startup before any containers are
allocated).
+ LOGGER.info("Purging offline helix instances before allocating containers
for helixClusterName={}, connectionString={}", helixManager.getClusterName(),
helixManager.getMetadataStoreConnectionString());
+ long offlineDuration = 0;
+ this.helixAdmin.purgeOfflineInstances(this.helixManager.getClusterName(),
offlineDuration);
Review Comment:
We should also add a config here to make this action configurable
##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -339,6 +342,15 @@ protected void startUp() throws Exception {
LOGGER.info("ApplicationMaster registration response: " + response);
this.maxResourceCapacity =
Optional.of(response.getMaximumResourceCapability());
+ // All previous helix instances should be purged on startup. Gobblin task
runners are stateless from helix
+ // perspective because all important state is persisted separately in
Workunit State Store or Watermark store.
+ // Offline duration of 0 means any offline instance should be purged
(Note: there aren't any online instances
+ // when this code runs, this is during startup before any containers are
allocated).
+ LOGGER.info("Purging offline helix instances before allocating containers
for helixClusterName={}, connectionString={}", helixManager.getClusterName(),
helixManager.getMetadataStoreConnectionString());
+ long offlineDuration = 0;
+ this.helixAdmin.purgeOfflineInstances(this.helixManager.getClusterName(),
offlineDuration);
Review Comment:
Agreed that we need to avoid the race condition, according to
[ZKHelixAdmin.purgeOfflineInstances()](https://github.com/apache/helix/blob/ee61ff434ee37d4ab8456c739b2229f99d9e0e72/helix-core/src/main/java/org/apache/helix/manager/zk/ZKHelixAdmin.java#L270).
Please do add a GTE or an alert here if this takes too long
Issue Time Tracking
-------------------
Worklog Id: (was: 809335)
Time Spent: 1.5h (was: 1h 20m)
> Purge offline helix instances during startup
> --------------------------------------------
>
> Key: GOBBLIN-1704
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1704
> Project: Apache Gobblin
> Issue Type: New Feature
> Reporter: Matthew Ho
> Priority: Major
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)