> > Therefore, if we bring up the external worker pool container together with > the runner container, which is one the supported approach by Flink Runner > on K8s
Exactly which approach are you talking about in the doc? I feel like there could be some misunderstanding here. Here is the configuration I'm talking about: https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/without_job_server/beam_flink_cluster.yaml Basically this config is describing a Flink task manager with a Beam worker pool sidecar. The user invokes it with: kubectl apply -f examples/beam/without_job_server/beam_flink_cluster.yaml It doesn't matter which container is started first, the task manager container or the worker pool sidecar, because no communication between the two containers is necessary at this time. The instructions are to start the cluster first and wait until it is ready to submit a job, e.g.: kubectl apply -f examples/beam/without_job_server/beam_wordcount_py.yaml The task manager only sends the worker pool requests once it's running a job. So for things to go wrong in the way you describe: 1. The user submits a job, then starts a Flink cluster -- reversing the order of steps in the instructions. 2. The worker pool sidecar takes way longer to start up than the task manager container for some reason. 3. The Flink cluster accepts and starts running the job before the worker pool sidecar is ready -- I'm not familiar enough with k8s lifecycle management or the Flink operator implementation to be sure if this is even possible. I've never seen this happen. But, of course there are plenty of unforeseen ways things can go wrong. So I'm not opposed to improving our error handling here more generally.
