bdoyle0182 opened a new issue, #5325:
URL: https://github.com/apache/openwhisk/issues/5325
This is consistently reproducible for me now. When a container is unpaused,
the keys are first unwatched and possibly deleted before the unpause operation
completes. At the same time as the unpause operation, a container remove can
happen (still trying to figure out what triggers this, but my theory was that
unwatching the container keys results in something watching it request to
delete the container). When the unpause operation happens, it seems that no
matter the result of the operation the future that requested it to be unpaused
will always be returned successful. This results in the etcd key for the
container being re-written as a part of the `.map` after the unpause future
completes. Since the container was deleted, this key is now no longer tracked
and will live indefinitely unless the component is restarted. This is a
critical bug for me moving forward with the scheduler since it's so easily
reproducible.
Relevant logs (a couple of them I added locally to follow the exact code
path better):
```
[2022-09-14T00:00:12.526Z] [INFO] [#tid_sid_unknown] [ContainerManager]
Choose a warmed container whisk/warmed
[2022-09-14T00:00:12.580Z] [INFO] [#tid_sid_unknown]
[FunctionPullingContainerPool] Choose warmed container
[2022-09-14T00:00:12.581Z] [INFO] [#tid_sid_containerCreation]
[DockerClientWithFileAccess] running /usr/bin/docker unpause
ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d (timeout: 10
seconds)
#these two unwatch logs are from above where I said the unregister data can
happen before the register #data when the unpause completes
[2022-09-14T00:00:12.581Z] [INFO] [#tid_sid_unknown] [WatcherService]
unwatch endpoint: UnwatchEndpoint(whisk/warmed
[2022-09-14T00:00:12.581Z] [INFO] [#tid_sid_unknown] [WatcherService]
unwatch endpoint: UnwatchEndpoint(whisk/namespace
[2022-09-14T00:00:12.581Z] [INFO] [#tid_sid_invokerNanny]
[DockerClientWithFileAccess] running /usr/bin/docker rm -f
ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d (timeout: 1
minute)
#added this log locally indicating the etcd key is removed
[2022-09-14T00:00:12.585Z] [INFO] [#tid_sid_unknown] [EtcdWorker]
successfully removed whisk/warmed
[2022-09-14T00:00:12.588Z] [INFO] [#tid_sid_unknown] [EtcdWorker]
successfully removed whisk/namespace
[2022-09-14T00:00:12.595Z] [INFO] [#tid_DgxZq5kjsHIwixkSsI5onK8zYSgpOkxM]
[MemoryQueue] Remove containerId because
ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d is not alive
#unpause operation completes
[2022-09-14T00:00:12.705Z] [INFO] [#tid_sid_containerCreation]
[FunctionPullingContainerProxy] Resumed container
ContainerId(ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d)
[2022-09-14T00:00:12.706Z] [INFO] [#tid_sid_unknown] [WatcherService] watch
endpoint: WatchEndpoint(whisk/namespace
#key is added back to etcd for deleted container
[2022-09-14T00:00:12.711Z] [INFO] [#tid_sid_unknown] [EtcdWorker]
successfully stored whisk/namespace
#added this log locally to denote when the memory queue sees an existing
container and adds it to #its existing set
[2022-09-14T00:00:12.711Z] [INFO] [#tid_sid_unknown] [MemoryQueue]
[redacted:Running] start tracking container
ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d
[2022-09-14T00:11:09.557Z] [INFO] [#tid_sid_unknown] [MemoryQueue]
[redacted:Running] The queue is timed out but there are still 0 activation
messages or (running: 1 ->
Set(ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d),
in-progress: 0 -> Set()) containers
```
Relevant code:
```
when(Paused) {
case Event(job: Initialize, data: WarmData) =>
implicit val transId = job.transId
val parent = context.parent
cancelTimer(IdleTimeoutName)
cancelTimer(KeepingTimeoutName)
data.container
.resume()
.map { _ =>
logging.info(this, s"Resumed container
${data.container.containerId}")
// put existing key again
dataManagementService ! RegisterData(
s"${ContainerKeys.existingContainers(data.invocationNamespace,
data.action.fullyQualifiedName(true), data.action.rev, Some(instance),
Some(data.container.containerId))}",
"")
parent ! Resumed(data)
// the new queue may locates on an different scheduler, so
recreate the activation client when necessary
// since akka port will no be used, we can put any value except 0
here
data.clientProxy ! RequestActivation(
newScheduler = Some(SchedulerEndpoints(job.schedulerHost,
job.rpcPort, 10)))
startSingleTimer(UnusedTimeoutName, StateTimeout, unusedTimeout)
timedOut = false
}
.recover {
case t: Throwable =>
logging.error(this, s"Failed to resume container
${data.container.containerId}, error: $t")
parent ! ResumeFailed(data)
self ! ResumeFailed(data)
}
// always clean data in etcd regardless of success and failure
dataManagementService ! UnregisterData(
ContainerKeys.warmedContainers(
data.invocationNamespace,
data.action.fullyQualifiedName(false),
data.revision,
instance,
data.container.containerId))
goto(Running)
```
You can see here that dataManagementService sends a request to unregister
the etcd keys not chained to the unpause operation so this can happen before
the docker unpause occurs. The `.map` is always entered even if the docker
unpause fails (verifiable through metric emission for unpause error from
`LoggingMarkers.INVOKER_DOCKER_CMD`. The only thing left for me to really
figure out is why `docker rm` occurs at the same time as the unpause operation.
Maybe it's as simple as the unpause fails for whatever reason triggering the
removal in which case maybe the fix is as simple as making sure the etcd write
doesn't occur if the unpause fails; but I'm not yet convinced that the docker
rm does not occur because of some side effect
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]