bdoyle0182 opened a new pull request, #5326: URL: https://github.com/apache/openwhisk/pull/5326
## Description You can read a more detailed series of events to hit this case in the corresponding issue, but here's the tldr: 1. Container doesn't have activations so transitions to pause container. 2. Container times out once paused and is ready to be deleted. 3. In order to delete once paused, a check is required for the count of containers to determine whether it should delete. 4. The etcd request fails and the failed future is piped to the fsm. The paused state doesn't handle this message type so it stashes it until a state transition and the container proxy will sit in this corrupted state until a new activation is received. 5. New activation is received and the container is attempted to be unpaused and the fsm transitions back to Running while it waits for the unpause future to complete. 6. When the fsm transitions, it unstashes the failed future message from 4. which is handled in the Running state and goes to destroy the container. 7. The container is destroyed, but the unpause future from 5. succeeds which has a side effect to rewrite the container key to etcd and this key is now orphaned forever since the container was actually destroyed. 8. The scheduler sees the container from the watch endpoint it's listening to and now the queue for the action is stuck thinking one container exists forever that actually doesn't exist. If activations are infrequent enough that only one container is needed by the scheduling decision maker, then this action can never be run unless the system is restarted. I've reproduced this case in my test environment many times and this change now handles everything gracefully. And added a unit test to simulate this case to verify the container proxy is gracefully torn down after a failed request to etcd. ## Related issue and scope - [X] I opened an issue to propose and discuss this change (#5325) ## My changes affect the following components - [ ] API - [ ] Controller - [ ] Message Bus (e.g., Kafka) - [ ] Loadbalancer - [ ] Scheduler - [X] Invoker - [ ] Intrinsic actions (e.g., sequences, conductors) - [ ] Data stores (e.g., CouchDB) - [ ] Tests - [ ] Deployment - [ ] CLI - [ ] General tooling - [ ] Documentation ## Types of changes: - [X] Bug fix (generally a non-breaking change which closes an issue). - [ ] Enhancement or new feature (adds new functionality). - [ ] Breaking change (a bug fix or enhancement which changes existing behavior). ## Checklist: - [X] I signed an [Apache CLA](https://github.com/apache/openwhisk/blob/master/CONTRIBUTING.md). - [X] I reviewed the [style guides](https://github.com/apache/openwhisk/blob/master/CONTRIBUTING.md#coding-standards) and followed the recommendations (Travis CI will check :). - [X] I added tests to cover my changes. - [ ] My changes require further changes to the documentation. - [ ] I updated the documentation where necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
