bdoyle0182 opened a new pull request, #5388:
URL: https://github.com/apache/openwhisk/pull/5388

   ## Description
   There is an edge case in the queue manager / memory queue where when the 
memory queue transitions from Idle to Removed, the action can get into an 
indefinite stuck state if it starts receiving activations again. This is 
because when transitioning from Idle to Removed, the `QueueRemoved` message to 
be sent from the MemoryQueue -> QueueManager is never sent which will remove 
the entry of the actor from the `QueuePool` trie map in the QueueManager. If 
the entry remains in `QueuePool`, the manager can still forward activations to 
the child memory queue fsm. The QueueManager receiving `QueueRemoved` is also 
responsible for sending `QueueRemovedCompleted` back to the MemoryQueue fsm 
which is what will actually make the memory queue stop itself. Since this 
series of events will never occur when transitioning from Idle to Removed, the 
Removed state thus becomes dependent on the `StateTimeout` to occur to actually 
stop the actor which _then_ will send the `QueueRemoved` message to the pare
 nt to have it removed from the `QueuePool`. It's important to note that the 
whole etcd series of events to remove the queue keys are successful and the 
queue manager acknowledges it with the `WatchEndpointRemoved`, but that does 
not remove the entry from `QueuePool`.
   
   Here is where things get catastrophic. Since the fsm actor entry will remain 
in the `QueuePool` trie map and the memory queue never sends the QueueRemoved 
message, the activations will continue sending to the memory queue IF a new 
activation comes in the five second window of the default configuration to time 
out the removed state. Akka fsm state timeouts work such that the timeout 
message is only sent IF no message is received otherwise the timer is reset 
every time a new message is received. The case in the removed state if 
receiving an activation will forward it back to the queue manager which will 
just send it back to the memory queue creating an indefinite cycle until the 
activation times out. Now if the action has a multi minute gap from having 
activations it will self heal because after the activation times out and no 
more come in, then it will self heal after five seconds. However, if that never 
happens the action will remain in a stuck state never executing activations un
 til the service is restarted. This makes this bug particularly hard to track 
down because sometimes it could self heal and sometimes it remains stuck 
forever.
   
   This StateTimeout behavior resetting the timer on each new message is 
already actually accounted for in the `Flushing` state so I've just added the 
same safe guard here to guarantee self recovery and that the queue will 
definitely shut down correctly after the stop grace time. However that's just a 
safeguard, I think there is still a remaining issue to solve that 
`QueueRemoved` needs to be sent to the `QueueManager` somewhere on transition 
from `Idle` to `Removed`, not only once the `Removed` state times out; 
otherwise transition from `Idle` -> `Removed` will always rely on the 
StateTimeout for the fsm to stop itself.
   
   ## Related issue and scope
   - [ ] I opened an issue to propose and discuss this change (#????)
   
   ## My changes affect the following components
   - [ ] API
   - [ ] Controller
   - [ ] Message Bus (e.g., Kafka)
   - [ ] Loadbalancer
   - [X] Scheduler
   - [ ] Invoker
   - [ ] Intrinsic actions (e.g., sequences, conductors)
   - [ ] Data stores (e.g., CouchDB)
   - [ ] Tests
   - [ ] Deployment
   - [ ] CLI
   - [ ] General tooling
   - [ ] Documentation
   
   ## Types of changes
   - [X] Bug fix (generally a non-breaking change which closes an issue).
   - [ ] Enhancement or new feature (adds new functionality).
   - [ ] Breaking change (a bug fix or enhancement which changes existing 
behavior).
   
   ## Checklist:
   - [X] I signed an [Apache 
CLA](https://github.com/apache/openwhisk/blob/master/CONTRIBUTING.md).
   - [X] I reviewed the [style 
guides](https://github.com/apache/openwhisk/blob/master/CONTRIBUTING.md#coding-standards)
 and followed the recommendations (Travis CI will check :).
   - [ ] I added tests to cover my changes.
   - [ ] My changes require further changes to the documentation.
   - [ ] I updated the documentation where necessary.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to