bdoyle0182 commented on issue #5286: URL: https://github.com/apache/openwhisk/issues/5286#issuecomment-1189490347
My local etcd lease timeout is 1. I'm updating it to 10 today as recommended by your current downstream. Regarding the queue manager and retrying 13 times to get the queue lease from etcd with exponential backoff, the issue is not being able to get the lease it's that it no longer exists so there is no longer a queue for this action. Order of events: 1. Couple second network blip with etcd results in leases timing out 2. Etcd connectivity recovers. 3. However the lease for the queue has failed by this point and the queue thus is removed 4. The controller is supposed to get notified when a `WatchEndpointRemoved` occurs for the respective lease so that it updates that it needs to send a new create queue request to a scheduler endpoint. However for whatever reason this doesn't occur and the controller thinks that the queue is still on the old scheduler host. 5. The controller continues to send activations to this scheduler which no longer has a queue for this action, which falls back to remote resolution so it tries to lookup the lease in etcd to find the correct scheduler to forward this activation to. 6. It then doesn't know what to do and fails the activation with `The activation has not been processed` since there is no lease available for the queue. 7. The controller keeps sending activations to this scheduler forever hitting this case and failing all activations for this action unless the hosts are restarted Hopefully increasing the lease timeout helps so that these short network issues do not result in losing leases. However that doesn't fix the issue of very specific edge cases not being able to gracefully recover when a lease is inevitable lost and needs to get recreated. At least for this specific case I have two suggestions, 1. Is the controller can cache and periodically check the queue lease still exists if not create new queue maybe? That way you won't resolve immediately, but at least can resolve after some time without restart. Though don't know if there's any case there where you could have duplicate queues for an action. 2. To recover immediately, instead of failing the activation in this case when the lease no longer exists for the action queue when you hit this case can we instead recreate the queue locally in the scheduler it was sent to and then it recovers rather than fail the activation and all subsequent activations? Or is this also a dangerous suggestion? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
