Worker Pool Lifecycle Management on Kubernetes

Kyle Weaver Wed, 02 Jun 2021 16:28:02 -0700

>
> Therefore, if we bring up the external worker pool container together with
> the runner container, which is one the supported approach by Flink Runner
> on K8s

Exactly which approach are you talking about in the doc? I feel like there
could be some misunderstanding here. Here is the configuration I'm talking
about:
https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/without_job_server/beam_flink_cluster.yaml

Basically this config is describing a Flink task manager with a Beam worker
pool sidecar. The user invokes it with:

kubectl apply -f examples/beam/without_job_server/beam_flink_cluster.yaml

It doesn't matter which container is started first, the task manager
container or the worker pool sidecar, because no communication between the
two containers is necessary at this time.

The instructions are to start the cluster first and wait until it is ready
to submit a job, e.g.:

kubectl apply -f examples/beam/without_job_server/beam_wordcount_py.yaml

The task manager only sends the worker pool requests once it's running a
job. So for things to go wrong in the way you describe:

1. The user submits a job, then starts a Flink cluster -- reversing the
order of steps in the instructions.
2. The worker pool sidecar takes way longer to start up than the task
manager container for some reason.
3. The Flink cluster accepts and starts running the job before the worker
pool sidecar is ready -- I'm not familiar enough with k8s lifecycle
management or the Flink operator implementation to be sure if this is even
possible.

I've never seen this happen. But, of course there are plenty of unforeseen
ways things can go wrong. So I'm not opposed to improving our error
handling here more generally.

Re: [DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

Reply via email to