phoerious edited a comment on pull request #16658: URL: https://github.com/apache/beam/pull/16658#issuecomment-1047144100
@ryanthompson591 @tvalentyn I updated the PR. The venvs are now using random names and are bound to the workers, which is the only way to make this safe. I also fixed how workers are cleaned up. Previously, they were simply SIGKILL'ed by the worker pool Python executable, which prevented any kind of clean up and also caused zombie processes inside the containers. I think there are also still some cases where processes are not cleaned up properly and just keep running forever, but most of that should be fixed now. Processes that keep running forever happen particularly when I'm using a global CombineFn, which causes Flink to believe that the last remaining worker is still running even though it has long finished. When that happens, not even cancelling the job will send signals to the remaining workers. But that's another bug (I reported that before on the mailing list, but never got a response). All of this needs some more testing, but it seems to be running fine on my Flink cluster at least. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
