bdoyle0182 commented on issue #5286:
URL: https://github.com/apache/openwhisk/issues/5286#issuecomment-1189490347

   My local etcd lease timeout is 1. I'm updating it to 10 today as recommended 
by your current downstream.
   
   Regarding the queue manager and retrying 13 times to get the queue lease 
from etcd with exponential backoff, the issue is not being able to get the 
lease it's that it no longer exists so there is no longer a queue for this 
action.
   
   Order of events:
   1. Couple second network blip with etcd results in leases timing out
   2. Etcd connectivity recovers.
   3. However the lease for the queue has failed by this point and the queue 
thus is removed
   4. The controller is supposed to get notified when a `WatchEndpointRemoved` 
occurs for the respective lease so that it updates that it needs to send a new 
create queue request to a scheduler endpoint. However for whatever reason this 
doesn't occur and the controller thinks that the queue is still on the old 
scheduler host.
   5. The controller continues to send activations to this scheduler which no 
longer has a queue for this action, which falls back to remote resolution so it 
tries to lookup the lease in etcd to find the correct scheduler to forward this 
activation to.
   6. It then doesn't know what to do and fails the activation with `The 
activation has not been processed` since there is no lease available for the 
queue. 
   7. The controller keeps sending activations to this scheduler forever 
hitting this case and failing all activations for this action unless the hosts 
are restarted 
   
   Hopefully increasing the lease timeout helps so that these short network 
issues do not result in losing leases. However that doesn't fix the issue of 
very specific edge cases not being able to gracefully recover when a lease is 
inevitable lost and needs to get recreated.
   
   At least for this specific case I have two suggestions,
   1. Is the controller can cache and periodically check the queue lease still 
exists if not create new queue maybe? That way you won't resolve immediately, 
but at least can resolve after some time without restart. Though don't know if 
there's any case there where you could have duplicate queues for an action.
   2. To recover immediately, instead of failing the activation in this case 
when the lease no longer exists for the action queue when you hit this case can 
we instead recreate the queue locally in the scheduler it was sent to and then 
it recovers rather than fail the activation and all subsequent activations? Or 
is this also a dangerous suggestion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to