bdoyle0182 opened a new issue, #5286:
URL: https://github.com/apache/openwhisk/issues/5286

   I had an action queue get into a stuck state after about two seconds of etcd 
downtime while other actions were able to recover gracefully. Essentially what 
it appears happens is that the queue endpoint key times out in etcd and no 
longer exists, but the controller doesn't hear this and continues to think the 
scheduler endpoint of the queue is on the same scheduler endpoint (I believe 
WatchEndpointRemoved should be sent to the controllers in this case but that 
doesn't seem to have happened). In the QueueManager of the scheduler the 
activation it was sent to, it then hits this code path because the queue 
doesn't exist on the host it sent it to and tries to remotely resolve it 
through etcd but the queue doesn't exist:
   
   ```      } recoverWith {
         case t =>
           logging.warn(this, s"[${msg.activationId}] activation has been 
dropped (${t.getMessage})")
           completeErrorActivation(msg, "The activation has not been processed: 
failed to get the queue endpoint.")
       }}```
       
   All requests for this action will then be dropped in the QueueManager unless 
the schedulers are restarted. Is there any way to make this more resilient so 
that if something gets stuck in this edge case, we can recover somehow without 
requiring a restart? @style95 
   
   
   Additional logs for the timeline:
   
   How I know connectivity to etcd fails is this logs emits for two seconds 
from the controller before all activations for that action begin to fail and 
all other actions become fine again.
   
   ```
   
   [WARN] [#tid_T8Gk2BdDdf4PIjq8W8Ta12kD0ZAOMgwE] [ActionsApi] No scheduler 
endpoint available [marker:controller_loadbalancer_error:2:0]
   --
   ```
   
   Then from the invoker this log is emitted from the containers that exist for 
the action until the schedulers are restarted:
   
   ```
   
   [ActivationClientProxy] The queue of action [REDACTED] does not exist. Check 
for queues in other schedulers.
   --
   ```
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to