[GitHub] [openwhisk] bdoyle0182 commented on issue #5325: [New Scheduler] Container unpausing results in key remaining in etcd for deleted container

GitBox Wed, 14 Sep 2022 17:00:26 -0700


bdoyle0182 commented on issue #5325:
URL: https://github.com/apache/openwhisk/issues/5325#issuecomment-1247415358


   Figured it out.
   
   So when the container is unpaused by the `FunctionPullingContainerProxy`, it 
changes state to `Running` while the async unpause code runs and a failure 
message is received triggering container cleanup while the unpause is in 
process.
   
   So here are the exact order of events:
   
   1. Unpause starts and goes to running
   
   ```
         implicit val transId = job.transId
         val parent = context.parent
         cancelTimer(IdleTimeoutName)
         cancelTimer(KeepingTimeoutName)
         data.container
           .resume()
           .map { _ =>
             logging.info(this, s"Resumed container 
${data.container.containerId}")
             // put existing key again
             dataManagementService ! RegisterData(
               s"${ContainerKeys.existingContainers(data.invocationNamespace, 
data.action.fullyQualifiedName(true), data.action.rev, Some(instance), 
Some(data.container.containerId))}",
               "")
             parent ! Resumed(data)
             // the new queue may locates on an different scheduler, so 
recreate the activation client when necessary
             // since akka port will no be used, we can put any value except 0 
here
             data.clientProxy ! RequestActivation(
               newScheduler = Some(SchedulerEndpoints(job.schedulerHost, 
job.rpcPort, 10)))
             startSingleTimer(UnusedTimeoutName, StateTimeout, unusedTimeout)
             timedOut = false
           }
           .recover {
             case t: Throwable =>
               logging.error(this, s"Failed to resume container 
${data.container.containerId}, error: $t")
               parent ! ResumeFailed(data)
               self ! ResumeFailed(data)
           }
   
         // always clean data in etcd regardless of success and failure
         dataManagementService ! UnregisterData(
           ContainerKeys.warmedContainers(
             data.invocationNamespace,
             data.action.fullyQualifiedName(false),
             data.revision,
             instance,
             data.container.containerId))
         goto(Running)
   ```
   
   2. Some future code triggered in the fsm to talk to etcd fails and pipes the 
failure message back to the fsm to process which is now in the running state.
   
   ```
       case Event(t: FailureMessage, data: WarmData) =>
         logging.error(this, s"failed to init or run in state: $stateName 
caused by: $t")
         // Stop containerProxy and ActivationClientProxy both immediately,
         // and don't send unhealthy state message to the health manager, it's 
already sent.
         cleanUp(
           data.container,
           data.invocationNamespace,
           data.action.fullyQualifiedName(withVersion = true),
           data.action.rev,
           Some(data.clientProxy))
   ```
   
   this is evidenced by this additional log I found that lines up the exact 
microsecond the container removal occurs from the initial shared logs.
   
   ```
   [2022-09-14T08:33:34.837Z] [ERROR] [#tid_sid_unknown] 
[FunctionPullingContainerProxy] failed to init or run in state: Running caused 
by: Failure(io.grpc.StatusRuntimeException: INTERNAL: HTTP/2 error code: 
PROTOCOL_ERRORReceived Goaway[p]req HEADERS: max concurrency reached)
   --
   
   3. The original unpause and the cleanup are now occurring at the same time. 
Since the state transitioned to `Running`, the cleanup now attempts to occur 
without attempting to unpause and goes straight to `destroyContainer`. The 
unpause succeeds and the subsequent `.map` on it to rewrite the etcd keys. 
Since the state transitioned to `Running`, the cleanup now attempts to occur 
without attempting to unpause and goes straight to `destroyContainer`.
   
   ```
     private def cleanUp(container: Container, clientProxy: Option[ActorRef], 
replacePrewarm: Boolean = true): State = {
   
       context.parent ! ContainerRemoved(replacePrewarm)
       val unpause = stateName match {
         case Paused => container.resume()(TransactionId.invokerNanny)
         case _      => Future.successful(())
       }
       unpause.andThen {
         case Success(_) => destroyContainer(container)
         case Failure(t) =>
           // docker may hang when try to remove a paused container, so we 
shouldn't remove it
           logging.error(this, s"Failed to resume container 
${container.containerId}, error: $t")
       }
       clientProxy match {
         case Some(clientProxy) => clientProxy ! StopClientProxy
         case None              => self ! ClientClosed
       }
       gotoIfNotThere(Removing)
     }
   ```
   
   4. The unpause operation successfully completes and rewrites the key to etcd 
after the container is destroyed or while it's in the process of being 
destroyed.
   
   So what's the best way to go about having extra protection here. I think the 
easiest thing to do would be to stay in the `Paused` state until the unpause 
operation completes. Then on completion send a message within the pauses state 
on success or failure for what to do next (if success go to running)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [openwhisk] bdoyle0182 commented on issue #5325: [New Scheduler] Container unpausing results in key remaining in etcd for deleted container

Reply via email to