[
https://issues.apache.org/jira/browse/GOBBLIN-1823?focusedWorklogId=860826&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-860826
]
ASF GitHub Bot logged work on GOBBLIN-1823:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 05/May/23 21:21
Start Date: 05/May/23 21:21
Worklog Time Spent: 10m
Work Description: ZihanLi58 opened a new pull request, #3692:
URL: https://github.com/apache/gobblin/pull/3692
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I
have checked off all the steps below!
### JIRA
- [ ] My PR addresses the following [Gobblin
JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references
them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-1823
### Description
- [ ] Here are some details about my PR, including screenshots (if
applicable):
**Problem**: When Yarn allocates "ghost containers" without calling the
onContainerAllocated() method and when the container is eventually released,
onContainersCompleted() is called, container numbers mismatches can occur.
In the onContainerAllocated() method, we add the container to the
containerMap using the container ID as the key, and increase the count for the
specific tag.
In the onContainersCompleted() method, we remove the container from the
containerMap and decrease the count. However, in some cases, we find that the
containerMap does not contain the ID, and we ignore this while still decreasing
the number of the allocated tag. We do this because sometimes
onContainersCompleted() is called before onContainerAllocated() for the same
container.
**Solution**
1. Add the removedContainerID map to track the containers that have been
released before onContainerAllocated() is called
2. Go through the container map to check the whether the assigned helix
instance is alive and release it when it's in-alive for more than 10 minutes
3. Add TIME_OUT and COMPLETED as the un-retryable partition state and log it
out to improve debugability.
### Tests
- [ ] My PR adds the following unit tests __OR__ does not need testing for
this extremely good reason:
Unit test for exiting function, it's hard to add a unit test for a bad yarn
container and helix disconnection situation.
### Commits
- [ ] My commits all reference JIRA issues in their subject lines, and I
have squashed multiple commits if they address the same issue. In addition, my
commits follow the guidelines from "[How to write a good git commit
message](http://chris.beams.io/posts/git-commit/)":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
4. Subject does not end with a period
5. Subject uses the imperative mood ("add", not "adding")
6. Body wraps at 72 characters
7. Body explains "what" and "why", not "how"
Issue Time Tracking
-------------------
Worklog Id: (was: 860826)
Remaining Estimate: 0h
Time Spent: 10m
> Improving Container Calculation and Allocation Methodology
> ----------------------------------------------------------
>
> Key: GOBBLIN-1823
> URL: https://issues.apache.org/jira/browse/GOBBLIN-1823
> Project: Apache Gobblin
> Issue Type: Improvement
> Reporter: Zihan Li
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When Yarn allocates "ghost containers" without calling the
> onContainerAllocated() method and when the container is eventually released,
> onContainersCompleted() is called, container numbers mismatches can occur.
> In the onContainerAllocated() method, we add the container to the
> containerMap using the container ID as the key, and increase the count for
> the specific tag.
> In the onContainersCompleted() method, we remove the container from the
> containerMap and decrease the count. However, in some cases, we find that the
> containerMap does not contain the ID, and we ignore this while still
> decreasing the number of the allocated tag. We do this because sometimes
> onContainersCompleted() is called before onContainerAllocated() for the same
> container.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)