[
https://issues.apache.org/jira/browse/GOBBLIN-1823?focusedWorklogId=861086&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-861086
]
ASF GitHub Bot logged work on GOBBLIN-1823:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 08/May/23 22:29
Start Date: 08/May/23 22:29
Worklog Time Spent: 10m
Work Description: homatthew commented on code in PR #3692:
URL: https://github.com/apache/gobblin/pull/3692#discussion_r1187743610
##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -501,12 +514,30 @@ public synchronized boolean
requestTargetNumberOfContainers(YarnContainerRequest
}
}
+ //We go through all the containers we have now and check whether the
assigned participant is still alive, if not, we should put them in idle
container Map
Review Comment:
"Check whether assigned participant is still alive". There is nothing here
to suggest that these instances are actually "assigned" anything.
I think a more accurate comment is instead something like:
> iterate through all containers allocated and check whether the
corresponding helix instance is still LIVE within the helix cluster. A
container that has a bad connection to zookeeper will be dropped from the Helix
cluster if the disconnection is greater than the specified timeout. In these
cases, we want to release the container to get a new container because these
containers won't be assigned tasks by Helix
##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -501,12 +514,30 @@ public synchronized boolean
requestTargetNumberOfContainers(YarnContainerRequest
}
}
+ //We go through all the containers we have now and check whether the
assigned participant is still alive, if not, we should put them in idle
container Map
+ //And we will release the container if the assigned participant still
offline after a given time
+
+ List<Container> containersToRelease = new ArrayList<>();
+ for (Map.Entry<ContainerId, ContainerInfo> entry :
this.containerMap.entrySet()) {
+ ContainerInfo containerInfo = entry.getValue();
+ if (!HelixUtils.isInstanceLive(helixManager,
containerInfo.getHelixParticipantId())) {
+ containerIdleSince.putIfAbsent(entry.getKey(),
System.currentTimeMillis());
+ if (System.currentTimeMillis() - containerIdleSince.get(entry.getKey())
+ >=
TimeUnit.MINUTES.toMillis(YarnAutoScalingManager.DEFAULT_MAX_CONTAINER_IDLE_TIME_BEFORE_SCALING_DOWN_MINUTES))
{
+ LOGGER.info("Releasing Container {} because the assigned participant
{} has been in-active for more than {} minutes",
+ entry.getKey(), containerInfo.getHelixParticipantId(),
YarnAutoScalingManager.DEFAULT_MAX_CONTAINER_IDLE_TIME_BEFORE_SCALING_DOWN_MINUTES);
+ containersToRelease.add(containerInfo.getContainer());
+ }
+ } else {
+ containerIdleSince.remove(entry.getKey());
+ }
+ }
+
// If the total desired is lower than the currently allocated amount then
release free containers.
// This is based on the currently allocated amount since containers may
still be in the process of being allocated
// and assigned work. Resizing based on numRequestedContainers at this
point may release a container right before
// or soon after it is assigned work.
- if (numTargetContainers < totalAllocatedContainers) {
- List<Container> containersToRelease = new ArrayList<>();
+ if (containersToRelease.isEmpty() && numTargetContainers <
totalAllocatedContainers) {
Review Comment:
Why do we entirely skip this block if there are already containers to
release? Hopefully the previous block doesn't happen that often, but this if
statement is still a bit strange to read.
To me, this should instead be:
```
if (numTargetContainers < totalAllocatedContainers -
containersToRelease.size())
```
##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -473,6 +475,17 @@ public synchronized boolean
requestTargetNumberOfContainers(YarnContainerRequest
return false;
}
+ //Correct the containerMap first as there is cases that
handleContainerCompletion() is called before onContainersAllocated()
+ for (ContainerId removedId :this.removedContainerID.keySet()) {
Review Comment:
nit: whitespace after `:`
##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -473,6 +475,17 @@ public synchronized boolean
requestTargetNumberOfContainers(YarnContainerRequest
return false;
}
+ //Correct the containerMap first as there is cases that
handleContainerCompletion() is called before onContainersAllocated()
+ for (ContainerId removedId :this.removedContainerID.keySet()) {
+ ContainerInfo containerInfo = this.containerMap.remove(removedId);
+ if (containerInfo != null) {
+ String helixTag = containerInfo.getHelixTag();
+ allocatedContainerCountMap.putIfAbsent(helixTag, new AtomicInteger(0));
+ this.allocatedContainerCountMap.get(helixTag).decrementAndGet();
Review Comment:
put if absent 0 and then decrementing means the resulting value would be -1.
That does not seem correct to me
Issue Time Tracking
-------------------
Worklog Id: (was: 861086)
Time Spent: 0.5h (was: 20m)
> Improving Container Calculation and Allocation Methodology
> ----------------------------------------------------------
>
> Key: GOBBLIN-1823
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1823
> Project: Apache Gobblin
> Issue Type: Improvement
> Reporter: Zihan Li
> Priority: Major
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> When Yarn allocates "ghost containers" without calling the
> onContainerAllocated() method and when the container is eventually released,
> onContainersCompleted() is called, container numbers mismatches can occur.
> In the onContainerAllocated() method, we add the container to the
> containerMap using the container ID as the key, and increase the count for
> the specific tag.
> In the onContainersCompleted() method, we remove the container from the
> containerMap and decrease the count. However, in some cases, we find that the
> containerMap does not contain the ID, and we ignore this while still
> decreasing the number of the allocated tag. We do this because sometimes
> onContainersCompleted() is called before onContainerAllocated() for the same
> container.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)