[GitHub] [openwhisk] bdoyle0182 commented on a diff in pull request #5320: add error handling to container manager when invoker query fails

GitBox Tue, 30 Aug 2022 23:29:19 -0700


bdoyle0182 commented on code in PR #5320:
URL: https://github.com/apache/openwhisk/pull/5320#discussion_r959200114



##########
core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/container/ContainerManager.scala:
##########
@@ -144,6 +144,11 @@ class ContainerManager(jobManagerFactory: ActorRefFactory 
=> ActorRef,
     logging.info(this, s"received ${msgs.size} creation message 
[${msgs.head.invocationNamespace}:${msgs.head.action}]")
     ContainerManager
       .getAvailableInvokers(etcdClient, memory, invocationNamespace)
+      .recover({
+        case t: Throwable =>
+          logging.error(this, s"Unable to get available invokers: 
${t.getMessage}.")
+          List.empty[InvokerHealth]
+      })

Review Comment:
   A request is done to etcd to get the list of healthy invokers with 
`.getAvailableInvokers`. If the future for the request to etcd fails for any 
reason, there was no failure handing on that call prior to this. This function 
is just a synchronous `unit` so if there's no failure handling on that future 
it just completes and never makes an acknowledgement back to the memory queue 
that the creation message has been processed either successfully or failed for 
the memory queue to properly decrement `creationIds`. Also should clarify if it 
silently fails at this point, it hasn't yet registered the creation job so the 
akka timer to timeout the creation job to make the call back to MemoryQueue on 
timeout is not created. The memory queue indefinitely thinks that there is a 
container creation in progress. If the action never needs more than one 
container, it will never be able to execute because the memory queue thinks one 
is in progress. Also the memory queue can not be stopped on timeout in
  this case because of the `creationIds` not being 0.
   
   
   ```
   else {
           logging.info(
             this,
             s"[$invocationNamespace:$action:$stateName] The queue is timed out 
but there are still ${queue.size} activation messages or (running: 
${containers.size}, in-progress: ${creationIds.size}) containers")
           stay
         }
   ```
   
         
   So while I think this covers the only edge case I know of, we really need an 
additional safeguard in `MemoryQueue` to eventually clear out knowledge of in 
progress containers if things get out of sync as there's no way to guarantee 
creationIds is perfectly in sync when it's dependent essentially on a fire and 
forget successfully making the callback at some point which is prone to 
introducing bugs even if this pr covers every case for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [openwhisk] bdoyle0182 commented on a diff in pull request #5320: add error handling to container manager when invoker query fails

Reply via email to