bdoyle0182 opened a new pull request, #5388: URL: https://github.com/apache/openwhisk/pull/5388
## Description There is an edge case in the queue manager / memory queue where when the memory queue transitions from Idle to Removed, the action can get into an indefinite stuck state if it starts receiving activations again. This is because when transitioning from Idle to Removed, the `QueueRemoved` message to be sent from the MemoryQueue -> QueueManager is never sent which will remove the entry of the actor from the `QueuePool` trie map in the QueueManager. If the entry remains in `QueuePool`, the manager can still forward activations to the child memory queue fsm. The QueueManager receiving `QueueRemoved` is also responsible for sending `QueueRemovedCompleted` back to the MemoryQueue fsm which is what will actually make the memory queue stop itself. Since this series of events will never occur when transitioning from Idle to Removed, the Removed state thus becomes dependent on the `StateTimeout` to occur to actually stop the actor which _then_ will send the `QueueRemoved` message to the pare nt to have it removed from the `QueuePool`. It's important to note that the whole etcd series of events to remove the queue keys are successful and the queue manager acknowledges it with the `WatchEndpointRemoved`, but that does not remove the entry from `QueuePool`. Here is where things get catastrophic. Since the fsm actor entry will remain in the `QueuePool` trie map and the memory queue never sends the QueueRemoved message, the activations will continue sending to the memory queue IF a new activation comes in the five second window of the default configuration to time out the removed state. Akka fsm state timeouts work such that the timeout message is only sent IF no message is received otherwise the timer is reset every time a new message is received. The case in the removed state if receiving an activation will forward it back to the queue manager which will just send it back to the memory queue creating an indefinite cycle until the activation times out. Now if the action has a multi minute gap from having activations it will self heal because after the activation times out and no more come in, then it will self heal after five seconds. However, if that never happens the action will remain in a stuck state never executing activations un til the service is restarted. This makes this bug particularly hard to track down because sometimes it could self heal and sometimes it remains stuck forever. This StateTimeout behavior resetting the timer on each new message is already actually accounted for in the `Flushing` state so I've just added the same safe guard here to guarantee self recovery and that the queue will definitely shut down correctly after the stop grace time. However that's just a safeguard, I think there is still a remaining issue to solve that `QueueRemoved` needs to be sent to the `QueueManager` somewhere on transition from `Idle` to `Removed`, not only once the `Removed` state times out; otherwise transition from `Idle` -> `Removed` will always rely on the StateTimeout for the fsm to stop itself. ## Related issue and scope - [ ] I opened an issue to propose and discuss this change (#????) ## My changes affect the following components - [ ] API - [ ] Controller - [ ] Message Bus (e.g., Kafka) - [ ] Loadbalancer - [X] Scheduler - [ ] Invoker - [ ] Intrinsic actions (e.g., sequences, conductors) - [ ] Data stores (e.g., CouchDB) - [ ] Tests - [ ] Deployment - [ ] CLI - [ ] General tooling - [ ] Documentation ## Types of changes - [X] Bug fix (generally a non-breaking change which closes an issue). - [ ] Enhancement or new feature (adds new functionality). - [ ] Breaking change (a bug fix or enhancement which changes existing behavior). ## Checklist: - [X] I signed an [Apache CLA](https://github.com/apache/openwhisk/blob/master/CONTRIBUTING.md). - [X] I reviewed the [style guides](https://github.com/apache/openwhisk/blob/master/CONTRIBUTING.md#coding-standards) and followed the recommendations (Travis CI will check :). - [ ] I added tests to cover my changes. - [ ] My changes require further changes to the documentation. - [ ] I updated the documentation where necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
