[ 
https://issues.apache.org/jira/browse/GOBBLIN-1823?focusedWorklogId=861444&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-861444
 ]

ASF GitHub Bot logged work on GOBBLIN-1823:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 10/May/23 23:50
            Start Date: 10/May/23 23:50
    Worklog Time Spent: 10m 
      Work Description: ZihanLi58 commented on code in PR #3692:
URL: https://github.com/apache/gobblin/pull/3692#discussion_r1190482948


##########
gobblin-yarn/src/main/java/org/apache/gobblin/yarn/YarnService.java:
##########
@@ -501,12 +514,30 @@ public synchronized boolean 
requestTargetNumberOfContainers(YarnContainerRequest
       }
     }
 
+    //We go through all the containers we have now and check whether the 
assigned participant is still alive, if not, we should put them in idle 
container Map
+    //And we will release the container if the assigned participant still 
offline after a given time
+
+    List<Container> containersToRelease = new ArrayList<>();
+    for (Map.Entry<ContainerId, ContainerInfo> entry : 
this.containerMap.entrySet()) {
+      ContainerInfo containerInfo = entry.getValue();
+      if (!HelixUtils.isInstanceLive(helixManager, 
containerInfo.getHelixParticipantId())) {
+        containerIdleSince.putIfAbsent(entry.getKey(), 
System.currentTimeMillis());
+        if (System.currentTimeMillis() - containerIdleSince.get(entry.getKey())
+            >= 
TimeUnit.MINUTES.toMillis(YarnAutoScalingManager.DEFAULT_MAX_CONTAINER_IDLE_TIME_BEFORE_SCALING_DOWN_MINUTES))
 {
+          LOGGER.info("Releasing Container {} because the assigned participant 
{} has been in-active for more than {} minutes",
+              entry.getKey(), containerInfo.getHelixParticipantId(), 
YarnAutoScalingManager.DEFAULT_MAX_CONTAINER_IDLE_TIME_BEFORE_SCALING_DOWN_MINUTES);
+          containersToRelease.add(containerInfo.getContainer());
+        }
+      } else {
+        containerIdleSince.remove(entry.getKey());
+      }
+    }
+
     // If the total desired is lower than the currently allocated amount then 
release free containers.
     // This is based on the currently allocated amount since containers may 
still be in the process of being allocated
     // and assigned work. Resizing based on numRequestedContainers at this 
point may release a container right before
     // or soon after it is assigned work.
-    if (numTargetContainers < totalAllocatedContainers) {
-      List<Container> containersToRelease = new ArrayList<>();
+    if (containersToRelease.isEmpty() && numTargetContainers < 
totalAllocatedContainers) {

Review Comment:
   Because at this point we still hold reference to those bad containers, and 
we might end up with releasing those containers again in this block. We can 
consider changing the algorism to have a hash set of containerIDtoRelease and 
collect the ContainerIdToRelease first and at the end of the call calculate the 
containerToRelease. 





Issue Time Tracking
-------------------

    Worklog Id:     (was: 861444)
    Time Spent: 40m  (was: 0.5h)

> Improving Container Calculation and Allocation Methodology
> ----------------------------------------------------------
>
>                 Key: GOBBLIN-1823
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1823
>             Project: Apache Gobblin
>          Issue Type: Improvement
>            Reporter: Zihan Li
>            Priority: Major
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> When Yarn allocates "ghost containers" without calling the 
> onContainerAllocated() method and when the container is eventually released, 
> onContainersCompleted() is called, container numbers mismatches can occur. 
> In the onContainerAllocated() method, we add the container to the 
> containerMap using the container ID as the key, and increase the count for 
> the specific tag.
> In the onContainersCompleted() method, we remove the container from the 
> containerMap and decrease the count. However, in some cases, we find that the 
> containerMap does not contain the ID, and we ignore this while still 
> decreasing the number of the allocated tag. We do this because sometimes 
> onContainersCompleted() is called before onContainerAllocated() for the same 
> container.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to