Heya,

thanks for the elaborate proposal.

Do you have any more information on why these containers are dying off in
the first place? In the case of Kubernetes/Mesos I could imagine we might
want to keep the Invoker's state consistent by checking it against the
respective API repeatedly. On Kubernetes for instance, you could setup an
informer that'd inform you about any state changes on the pods that this
Invoker has spawned. If a prewarm container dies this way, we can simply
remove it from the Invoker's bookkeeping and trigger a backfill.

Secondly, could we potentially fold this check into the HTTP requests
themselves? If we get a "connection refused" on an action that we knew
worked before, we can safely retry. There should be a set of exceptions
that our HTTP clients should surface that should be safe for us to retry in
the invoker anyway. The only addition you'd need in this case is an
enhancement on the ContainerProxy's state machine I believe, that allows
for such a retrying use-case. The "connection refused" use-case I mentioned
should be equivalent to the TCP probe you're doing now.

WDYT?

Cheers,
Markus

Am So., 27. Okt. 2019 um 02:56 Uhr schrieb Tyson Norris
<tnor...@adobe.com.invalid>:

> Hi Whiskers –
> We periodically have an unfortunate problem where a docker container (or
> worse, many of them) dies off unexpectedly, outside of HTTP usage from
> invoker. In these cases, prewarm or warm containers may still have
> references at the Invoker, and eventually if an activation arrives that
> matches those container references, the HTTP workflow starts and fails
> immediately since the node is not listening anymore, resulting in failed
> activations. Or, any even worse situation, can be when a container failed
> earlier, and a new container, initialized with a different action is
> initialized on the same host and port (more likely a problem for k8s/mesos
> cluster usage).
>
> To mitigate these issues, I put together a health check process [1] from
> invoker to action containers, where we can test
>
>   *   prewarm containers periodically to verify they are still
> operational, and
>   *   warm containers immediately after resuming them (before HTTP
> requests are sent)
> In case of prewarm failure, we should backfill the prewarms to the
> specified config count.
> In case of warm failure, the activation is rescheduled to ContainerPool,
> which typically would either route to a different prewarm, or start a new
> cold container.
>
> The test ping is in the form of tcp connection only, since we otherwise
> need to update the HTTP protocol implemented by all runtimes. This test is
> good enough for the worst case of “container has gone missing”, but cannot
> test for more subtle problems like “/run endpoint is broken”. There could
> be other checks to increase the quality of test we add in the future, but
> most of this I think requires expanding the HTTP protocol and state managed
> at the container, and I wanted to get something working for basic
> functionality to start with.
>
> Let me know if you have opinions about this, and we can discuss  here or
> in the PR.
> Thanks
> Tyson
>
> [1] https://github.com/apache/openwhisk/pull/4698
>

Reply via email to