bdoyle0182 commented on issue #5325:
URL: https://github.com/apache/openwhisk/issues/5325#issuecomment-1247415358
Figured it out.
So when the container is unpaused by the `FunctionPullingContainerProxy`, it
changes state to `Running` while the async unpause code runs and a failure
message is received triggering container cleanup while the unpause is in
process.
So here are the exact order of events:
1. Unpause starts and goes to running
```
implicit val transId = job.transId
val parent = context.parent
cancelTimer(IdleTimeoutName)
cancelTimer(KeepingTimeoutName)
data.container
.resume()
.map { _ =>
logging.info(this, s"Resumed container
${data.container.containerId}")
// put existing key again
dataManagementService ! RegisterData(
s"${ContainerKeys.existingContainers(data.invocationNamespace,
data.action.fullyQualifiedName(true), data.action.rev, Some(instance),
Some(data.container.containerId))}",
"")
parent ! Resumed(data)
// the new queue may locates on an different scheduler, so
recreate the activation client when necessary
// since akka port will no be used, we can put any value except 0
here
data.clientProxy ! RequestActivation(
newScheduler = Some(SchedulerEndpoints(job.schedulerHost,
job.rpcPort, 10)))
startSingleTimer(UnusedTimeoutName, StateTimeout, unusedTimeout)
timedOut = false
}
.recover {
case t: Throwable =>
logging.error(this, s"Failed to resume container
${data.container.containerId}, error: $t")
parent ! ResumeFailed(data)
self ! ResumeFailed(data)
}
// always clean data in etcd regardless of success and failure
dataManagementService ! UnregisterData(
ContainerKeys.warmedContainers(
data.invocationNamespace,
data.action.fullyQualifiedName(false),
data.revision,
instance,
data.container.containerId))
goto(Running)
```
2. Some future code triggered in the fsm to talk to etcd fails and pipes the
failure message back to the fsm to process which is now in the running state.
```
case Event(t: FailureMessage, data: WarmData) =>
logging.error(this, s"failed to init or run in state: $stateName
caused by: $t")
// Stop containerProxy and ActivationClientProxy both immediately,
// and don't send unhealthy state message to the health manager, it's
already sent.
cleanUp(
data.container,
data.invocationNamespace,
data.action.fullyQualifiedName(withVersion = true),
data.action.rev,
Some(data.clientProxy))
```
this is evidenced by this additional log I found that lines up the exact
microsecond the container removal occurs from the initial shared logs.
```
[2022-09-14T08:33:34.837Z] [ERROR] [#tid_sid_unknown]
[FunctionPullingContainerProxy] failed to init or run in state: Running caused
by: Failure(io.grpc.StatusRuntimeException: INTERNAL: HTTP/2 error code:
PROTOCOL_ERRORReceived Goaway[p]req HEADERS: max concurrency reached)
--
3. The original unpause and the cleanup are now occurring at the same time.
Since the state transitioned to `Running`, the cleanup now attempts to occur
without attempting to unpause and goes straight to `destroyContainer`. The
unpause succeeds and the subsequent `.map` on it to rewrite the etcd keys.
Since the state transitioned to `Running`, the cleanup now attempts to occur
without attempting to unpause and goes straight to `destroyContainer`.
```
private def cleanUp(container: Container, clientProxy: Option[ActorRef],
replacePrewarm: Boolean = true): State = {
context.parent ! ContainerRemoved(replacePrewarm)
val unpause = stateName match {
case Paused => container.resume()(TransactionId.invokerNanny)
case _ => Future.successful(())
}
unpause.andThen {
case Success(_) => destroyContainer(container)
case Failure(t) =>
// docker may hang when try to remove a paused container, so we
shouldn't remove it
logging.error(this, s"Failed to resume container
${container.containerId}, error: $t")
}
clientProxy match {
case Some(clientProxy) => clientProxy ! StopClientProxy
case None => self ! ClientClosed
}
gotoIfNotThere(Removing)
}
```
4. The unpause operation successfully completes and rewrites the key to etcd
after the container is destroyed or while it's in the process of being
destroyed.
So what's the best way to go about having extra protection here. I think the
easiest thing to do would be to stay in the `Paused` state until the unpause
operation completes. Then on completion send a message within the pauses state
on success or failure for what to do next (if success go to running)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]