I don't think there's any specific reason we don't set a timeout, I'm guessing it was just never worth the effort of implementing. If it's stuck it should be pretty obvious from the logs: "Still waiting for startup of environment from {} for worker id {}"
On Thu, May 27, 2021 at 4:04 PM Ke Wu <ke.wu...@gmail.com> wrote: > Hi Kyle, > > Thank you for the prompt response and apologize for the late reply. > > [1] seems to be only available in python portable_runner but not java > PortableRunner, is it intended or we could add similar changes in java as > well? > > [2] makes sense to block since the wait/retry is handled in the previous > prepare(), however, is there any specific reason why we do not want to > support timeout in start worker request? > > Best, > Ke > > On May 14, 2021, at 11:25 AM, Kyle Weaver <kcwea...@google.com> wrote: > > 1. and 2. are both facilitated by GRPC, which takes care of most of the > retry/wait logic. In some places we have a configurable timeout (which > defaults to 60s) [1], while in other places we block [2][3]. > > [1] https://issues.apache.org/jira/browse/BEAM-7933 > [2] > https://github.com/apache/beam/blob/51541a595b09751dd3dde2c50caf2a968ac01b68/sdks/python/apache_beam/runners/portability/portable_runner.py#L238-L242 > [3] > https://github.com/apache/beam/blob/9601bdef8870bc6acc7895c06252e43ec040bd8c/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ExternalEnvironmentFactory.java#L115 > > On Fri, May 14, 2021 at 10:51 AM Ke Wu <ke.wu...@gmail.com> wrote: > >> Hello All, >> >> I came across this question when I am reading Beam on Flink on Kubernetes >> <https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.x9qy4wlfgc1g> >> and >> flink-on-k8s-operator >> <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/tree/0310df76d6e2128cd5d2bc51fae4e842d370c463> >> and >> realized that there seems no retry/wait logic built in PortableRunner >> nor ExternalEnvironmentFactory, (correct me if I am wrong) which creates >> implications that: >> >> 1. Job Server needs to be ready to accept request before SDK Client could >> submit request. >> 2. External Worker Pool Service needs to be ready to accept start/stop >> worker request before runner starts to request. >> >> This may bring some challenges on k8s since Flink opt to use multi >> containers pattern when bringing up a beam portable pipeline, in addition, >> I don’t find any special lifecycle management in place to guarantee the >> order, e.g. External Worker Pool Service container to start and ready >> before the task manager container to start making requests. >> >> I am wondering if I missed anything to guarantee the readiness of the >> dependent service or we are relying on that dependent containers are much >> lighter weigh so it should, in most time, be ready before the other >> container start to make requests. >> >> Best, >> Ke >> >> >