Worker Pool Lifecycle Management on Kubernetes

Ke Wu Thu, 27 May 2021 18:11:21 -0700

Good to know. We are working on running java portable pipeline for Samza runner 
and I believe we could take on the task to enhance the java workflow to support 
timeout/retry etc on gRPC calls.


Created BEAM-12419 <https://issues.apache.org/jira/browse/BEAM-12419> to track 
the work.

Best,
Ke

> On May 27, 2021, at 4:30 PM, Kyle Weaver <[email protected]> wrote:
> 
> I don't think there's any specific reason we don't set a timeout, I'm 
> guessing it was just never worth the effort of implementing. If it's stuck it 
> should be pretty obvious from the logs: "Still waiting for startup of 
> environment from {} for worker id {}"
> 
> On Thu, May 27, 2021 at 4:04 PM Ke Wu <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Kyle,
> 
> Thank you for the prompt response and apologize for the late reply. 
> 
> [1] seems to be only available in python portable_runner but not java 
> PortableRunner, is it intended or we could add similar changes in java as 
> well?
> 
> [2] makes sense to block since the wait/retry is handled in the previous 
> prepare(), however, is there any specific reason why we do not want to 
> support timeout in start worker request?
> 
> Best,
> Ke
> 
>> On May 14, 2021, at 11:25 AM, Kyle Weaver <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> 1. and 2. are both facilitated by GRPC, which takes care of most of the 
>> retry/wait logic. In some places we have a configurable timeout (which 
>> defaults to 60s) [1], while in other places we block [2][3].
>> 
>> [1] https://issues.apache.org/jira/browse/BEAM-7933 
>> <https://issues.apache.org/jira/browse/BEAM-7933>
>> [2] 
>> https://github.com/apache/beam/blob/51541a595b09751dd3dde2c50caf2a968ac01b68/sdks/python/apache_beam/runners/portability/portable_runner.py#L238-L242
>>  
>> <https://github.com/apache/beam/blob/51541a595b09751dd3dde2c50caf2a968ac01b68/sdks/python/apache_beam/runners/portability/portable_runner.py#L238-L242>
>> [3] 
>> https://github.com/apache/beam/blob/9601bdef8870bc6acc7895c06252e43ec040bd8c/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ExternalEnvironmentFactory.java#L115
>>  
>> <https://github.com/apache/beam/blob/9601bdef8870bc6acc7895c06252e43ec040bd8c/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ExternalEnvironmentFactory.java#L115>
>> On Fri, May 14, 2021 at 10:51 AM Ke Wu <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hello All,
>> 
>> I came across this question when I am reading Beam on Flink on Kubernetes 
>> <https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.x9qy4wlfgc1g>
>>  and flink-on-k8s-operator 
>> <https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/tree/0310df76d6e2128cd5d2bc51fae4e842d370c463>
>>  and realized that there seems no retry/wait logic built in PortableRunner 
>> nor ExternalEnvironmentFactory, (correct me if I am wrong) which creates 
>> implications that:
>> 
>> 1. Job Server needs to be ready to accept request before SDK Client could 
>> submit request.
>> 2. External Worker Pool Service needs to be ready to accept start/stop 
>> worker request before runner starts to request.
>> 
>> This may bring some challenges on k8s since Flink opt to use multi 
>> containers pattern when bringing up a beam portable pipeline, in addition, I 
>> don’t find any special lifecycle management in place to guarantee the order, 
>> e.g. External Worker Pool Service container to start and ready before the 
>> task manager container to start making requests. 
>> 
>> I am wondering if I missed anything to guarantee the readiness of the 
>> dependent service or we are relying on that dependent containers are much 
>> lighter weigh so it should, in most time, be ready before the other 
>> container start to make requests. 
>> 
>> Best,
>> Ke
>> 
>

Re: [DISCUSS] Client SDK/Job Server/Worker Pool Lifecycle Management on Kubernetes

Reply via email to