ZihanLi58 opened a new pull request, #3692: URL: https://github.com/apache/gobblin/pull/3692
Dear Gobblin maintainers, Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below! ### JIRA - [ ] My PR addresses the following [Gobblin JIRA](https://issues.apache.org/jira/browse/GOBBLIN/) issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR" - https://issues.apache.org/jira/browse/GOBBLIN-1823 ### Description - [ ] Here are some details about my PR, including screenshots (if applicable): **Problem**: When Yarn allocates "ghost containers" without calling the onContainerAllocated() method and when the container is eventually released, onContainersCompleted() is called, container numbers mismatches can occur. In the onContainerAllocated() method, we add the container to the containerMap using the container ID as the key, and increase the count for the specific tag. In the onContainersCompleted() method, we remove the container from the containerMap and decrease the count. However, in some cases, we find that the containerMap does not contain the ID, and we ignore this while still decreasing the number of the allocated tag. We do this because sometimes onContainersCompleted() is called before onContainerAllocated() for the same container. **Solution** 1. Add the removedContainerID map to track the containers that have been released before onContainerAllocated() is called 2. Go through the container map to check the whether the assigned helix instance is alive and release it when it's in-alive for more than 10 minutes 3. Add TIME_OUT and COMPLETED as the un-retryable partition state and log it out to improve debugability. ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: Unit test for exiting function, it's hard to add a unit test for a bad yarn container and helix disconnection situation. ### Commits - [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 2. Subject is limited to 50 characters 4. Subject does not end with a period 5. Subject uses the imperative mood ("add", not "adding") 6. Body wraps at 72 characters 7. Body explains "what" and "why", not "how" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
