ZihanLi58 commented on code in PR #3692:
URL: https://github.com/apache/gobblin/pull/3692#discussion_r1190482948
##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -501,12 +514,30 @@ public synchronized boolean
requestTargetNumberOfContainers(YarnContainerRequest
}
}
+ //We go through all the containers we have now and check whether the
assigned participant is still alive, if not, we should put them in idle
container Map
+ //And we will release the container if the assigned participant still
offline after a given time
+
+ List<Container> containersToRelease = new ArrayList<>();
+ for (Map.Entry<ContainerId, ContainerInfo> entry :
this.containerMap.entrySet()) {
+ ContainerInfo containerInfo = entry.getValue();
+ if (!HelixUtils.isInstanceLive(helixManager,
containerInfo.getHelixParticipantId())) {
+ containerIdleSince.putIfAbsent(entry.getKey(),
System.currentTimeMillis());
+ if (System.currentTimeMillis() - containerIdleSince.get(entry.getKey())
+ >=
TimeUnit.MINUTES.toMillis(YarnAutoScalingManager.DEFAULT_MAX_CONTAINER_IDLE_TIME_BEFORE_SCALING_DOWN_MINUTES))
{
+ LOGGER.info("Releasing Container {} because the assigned participant
{} has been in-active for more than {} minutes",
+ entry.getKey(), containerInfo.getHelixParticipantId(),
YarnAutoScalingManager.DEFAULT_MAX_CONTAINER_IDLE_TIME_BEFORE_SCALING_DOWN_MINUTES);
+ containersToRelease.add(containerInfo.getContainer());
+ }
+ } else {
+ containerIdleSince.remove(entry.getKey());
+ }
+ }
+
// If the total desired is lower than the currently allocated amount then
release free containers.
// This is based on the currently allocated amount since containers may
still be in the process of being allocated
// and assigned work. Resizing based on numRequestedContainers at this
point may release a container right before
// or soon after it is assigned work.
- if (numTargetContainers < totalAllocatedContainers) {
- List<Container> containersToRelease = new ArrayList<>();
+ if (containersToRelease.isEmpty() && numTargetContainers <
totalAllocatedContainers) {
Review Comment:
Because at this point we still hold reference to those bad containers, and
we might end up with releasing those containers again in this block. We can
consider changing the algorism to have a hash set of containerIDtoRelease and
collect the ContainerIdToRelease first and at the end of the call calculate the
containerToRelease.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]