bdoyle0182 commented on code in PR #5320:
URL: https://github.com/apache/openwhisk/pull/5320#discussion_r959200114
##########
core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/container/ContainerManager.scala:
##########
@@ -144,6 +144,11 @@ class ContainerManager(jobManagerFactory: ActorRefFactory
=> ActorRef,
logging.info(this, s"received ${msgs.size} creation message
[${msgs.head.invocationNamespace}:${msgs.head.action}]")
ContainerManager
.getAvailableInvokers(etcdClient, memory, invocationNamespace)
+ .recover({
+ case t: Throwable =>
+ logging.error(this, s"Unable to get available invokers:
${t.getMessage}.")
+ List.empty[InvokerHealth]
+ })
Review Comment:
A request is done to etcd to get the list of healthy invokers with
`.getAvailableInvokers`. If the future for the request to etcd fails for any
reason, there was no failure handing on that call prior to this. This function
is just a synchronous `unit` so if there's no failure handling on that future
it just completes and never makes an acknowledgement back to the memory queue
that the creation message has been processed either successfully or failed for
the memory queue to properly decrement `creationIds`. Also should clarify if it
silently fails at this point, it hasn't yet registered the creation job so the
akka timer to timeout the creation job to make the call back to MemoryQueue on
timeout is not created. The memory queue indefinitely thinks that there is a
container creation in progress. If the action never needs more than one
container, it will never be able to execute because the memory queue thinks one
is in progress. Also the memory queue can not be stopped on timeout in
this case because of the `creationIds` not being 0.
```
else {
logging.info(
this,
s"[$invocationNamespace:$action:$stateName] The queue is timed out
but there are still ${queue.size} activation messages or (running:
${containers.size}, in-progress: ${creationIds.size}) containers")
stay
}
```
So while I think this covers the only edge case I know of, we really need an
additional safeguard in `MemoryQueue` to eventually clear out knowledge of in
progress containers if things get out of sync as there's no way to guarantee
creationIds is perfectly in sync when it's dependent essentially on a fire and
forget successfully making the callback at some point which is prone to
introducing bugs even if this pr covers every case for now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]