[jira] [Work logged] (BEAM-12792) Multiple jobs running on Flink session cluster reuse the persistent Python environment.

ASF GitHub Bot (Jira) Mon, 21 Feb 2022 10:37:04 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-12792?focusedWorklogId=730487&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-730487
 ]


ASF GitHub Bot logged work on BEAM-12792:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 21/Feb/22 18:36
            Start Date: 21/Feb/22 18:36
    Worklog Time Spent: 10m 
      Work Description: phoerious commented on pull request #16658:
URL: https://github.com/apache/beam/pull/16658#issuecomment-1047144100


   @ryanthompson591 @tvalentyn I updated the PR. The venvs are now using random 
names and are bound to the workers, which is the only way to make this safe.
   
   I also fixed how workers are cleaned up. Previously, they were simply 
SIGKILL'ed by the worker pool Python executable, which prevented any kind of 
clean up and also caused zombie processes inside the containers. I think there 
are also still some cases where processes are not cleaned up properly and just 
keep running forever, but most of that should be fixed now. Processes that keep 
running forever happen particularly when I'm using a global CombineFn, which 
causes Flink to believe that the last remaining worker is still running even 
though it has long finished. When that happens, not even cancelling the job 
will send signals to the remaining workers. But that's another bug (I think I 
reported that before on the mailing list, but never got a response).
   
   All of this needs some more testing, but it seems to be running fine on my 
Flink cluster at least.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 730487)
    Time Spent: 5h 50m  (was: 5h 40m)

> Multiple jobs running on Flink session cluster reuse the persistent Python 
> environment.
> ---------------------------------------------------------------------------------------
>
>                 Key: BEAM-12792
>                 URL: https://issues.apache.org/jira/browse/BEAM-12792
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-harness
>    Affects Versions: 2.27.0, 2.28.0, 2.29.0, 2.30.0, 2.31.0
>         Environment: Kubernetes 1.20 on Ubuntu 18.04.
>            Reporter: Jens Wiren
>            Priority: P1
>              Labels: FlinkRunner, beam
>          Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> I'm running TFX pipelines on a Flink cluster using Beam in k8s. However, 
> extra python packages passed to the Flink runner (or rather beam worker 
> side-car) are only installed once per deployment cycle. Example:
>  # Flink is deployed and is up and running
>  # A TFX pipeline starts, submits a job to Flink along with a python whl of 
> custom code and beam ops.
>  # The beam worker installs the package and the pipeline finishes succesfully.
>  # A new TFX pipeline is build where a new beam fn is introduced, the pipline 
> is started and the new whl is submitted as in step 2).
>  # This time, the new package is not being installed in the beam worker 
> causing the job to fail due to a reference which does not exist in the beam 
> worker, since it didn't install the new package.
>  
> I started using Flink from beam version 2.27 and it has been an issue all the 
> time.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (BEAM-12792) Multiple jobs running on Flink session cluster reuse the persistent Python environment.

Reply via email to