bdoyle0182 opened a new pull request, #5326:
URL: https://github.com/apache/openwhisk/pull/5326

   ## Description
   You can read a more detailed series of events to hit this case in the 
corresponding issue, but here's the tldr:
   
   1. Container doesn't have activations so transitions to pause container.
   2. Container times out once paused and is ready to be deleted.
   3. In order to delete once paused, a check is required for the count of 
containers to determine whether it should delete.
   4. The etcd request fails and the failed future is piped to the fsm. The 
paused state doesn't handle this message type so it stashes it until a state 
transition and the container proxy will sit in this corrupted state until a new 
activation is received.
   5. New activation is received and the container is attempted to be unpaused 
and the fsm transitions back to Running while it waits for the unpause future 
to complete.
   6. When the fsm transitions, it unstashes the failed future message from 4. 
which is handled in the Running state and goes to destroy the container.
   7. The container is destroyed, but the unpause future from 5. succeeds which 
has a side effect to rewrite the container key to etcd and this key is now 
orphaned forever since the container was actually destroyed.
   8. The scheduler sees the container from the watch endpoint it's listening 
to and now the queue for the action is stuck thinking one container exists 
forever that actually doesn't exist. If activations are infrequent enough that 
only one container is needed by the scheduling decision maker, then this action 
can never be run unless the system is restarted.
   
   I've reproduced this case in my test environment many times and this change 
now handles everything gracefully. And added a unit test to simulate this case 
to verify the container proxy is gracefully torn down after a failed request to 
etcd.
   
   ## Related issue and scope
   - [X] I opened an issue to propose and discuss this change (#5325)
   
   ## My changes affect the following components
   - [ ] API
   - [ ] Controller
   - [ ] Message Bus (e.g., Kafka)
   - [ ] Loadbalancer
   - [ ] Scheduler
   - [X] Invoker
   - [ ] Intrinsic actions (e.g., sequences, conductors)
   - [ ] Data stores (e.g., CouchDB)
   - [ ] Tests
   - [ ] Deployment
   - [ ] CLI
   - [ ] General tooling
   - [ ] Documentation
   
   ## Types of changes:
   - [X] Bug fix (generally a non-breaking change which closes an issue).
   - [ ] Enhancement or new feature (adds new functionality).
   - [ ] Breaking change (a bug fix or enhancement which changes existing 
behavior).
   
   ## Checklist:
   - [X] I signed an [Apache 
CLA](https://github.com/apache/openwhisk/blob/master/CONTRIBUTING.md).
   - [X] I reviewed the [style 
guides](https://github.com/apache/openwhisk/blob/master/CONTRIBUTING.md#coding-standards)
 and followed the recommendations (Travis CI will check :).
   - [X] I added tests to cover my changes.
   - [ ] My changes require further changes to the documentation.
   - [ ] I updated the documentation where necessary.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to