[ 
https://issues.apache.org/jira/browse/GOBBLIN-1823?focusedWorklogId=860826&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-860826
 ]

ASF GitHub Bot logged work on GOBBLIN-1823:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 05/May/23 21:21
            Start Date: 05/May/23 21:21
    Worklog Time Spent: 10m 
      Work Description: ZihanLi58 opened a new pull request, #3692:
URL: https://github.com/apache/gobblin/pull/3692

   Dear Gobblin maintainers,
   
   Please accept this PR. I understand that it will not be reviewed until I 
have checked off all the steps below!
   
   
   ### JIRA
   - [ ] My PR addresses the following [Gobblin 
JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references 
them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
       - https://issues.apache.org/jira/browse/GOBBLIN-1823
   
   
   ### Description
   - [ ] Here are some details about my PR, including screenshots (if 
applicable):
   **Problem**: When Yarn allocates "ghost containers" without calling the 
onContainerAllocated() method and when the container is eventually released, 
onContainersCompleted() is called, container numbers mismatches can occur. 
   In the onContainerAllocated() method, we add the container to the 
containerMap using the container ID as the key, and increase the count for the 
specific tag.
   In the onContainersCompleted() method, we remove the container from the 
containerMap and decrease the count. However, in some cases, we find that the 
containerMap does not contain the ID, and we ignore this while still decreasing 
the number of the allocated tag. We do this because sometimes 
onContainersCompleted() is called before onContainerAllocated() for the same 
container.
   
   **Solution**
   1. Add the removedContainerID map to track the containers that have been 
released before onContainerAllocated() is called
   2. Go through the container map to check the whether the assigned helix 
instance is alive and release it when it's in-alive for more than 10 minutes
   3. Add TIME_OUT and COMPLETED as the un-retryable partition state and log it 
out to improve debugability.   
   
   ### Tests
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   Unit test for exiting function, it's hard to add a unit test for a bad yarn 
container and helix disconnection situation.
   
   ### Commits
   - [ ] My commits all reference JIRA issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
       1. Subject is separated from body by a blank line
       2. Subject is limited to 50 characters
       4. Subject does not end with a period
       5. Subject uses the imperative mood ("add", not "adding")
       6. Body wraps at 72 characters
       7. Body explains "what" and "why", not "how"
   
   




Issue Time Tracking
-------------------

            Worklog Id:     (was: 860826)
    Remaining Estimate: 0h
            Time Spent: 10m

> Improving Container Calculation and Allocation Methodology
> ----------------------------------------------------------
>
>                 Key: GOBBLIN-1823
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1823
>             Project: Apache Gobblin
>          Issue Type: Improvement
>            Reporter: Zihan Li
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When Yarn allocates "ghost containers" without calling the 
> onContainerAllocated() method and when the container is eventually released, 
> onContainersCompleted() is called, container numbers mismatches can occur. 
> In the onContainerAllocated() method, we add the container to the 
> containerMap using the container ID as the key, and increase the count for 
> the specific tag.
> In the onContainersCompleted() method, we remove the container from the 
> containerMap and decrease the count. However, in some cases, we find that the 
> containerMap does not contain the ID, and we ignore this while still 
> decreasing the number of the allocated tag. We do this because sometimes 
> onContainersCompleted() is called before onContainerAllocated() for the same 
> container.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to