[ https://issues.apache.org/jira/browse/FLINK-23493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450814#comment-17450814 ]
Huang Xingbo commented on FLINK-23493: -------------------------------------- I'm very sorry, because there is something wrong with my mail client, the mail was not received in time. Let me share with you the current progress. The reasons for these hanging tests are the same. Let me first talk about the steps of the start-up phase of the operator that runs the Python UDF whichi can help know why the progress hangs: step 1: The Java operator will start a Grpc Server. step 2: The Java Operator will start a child process which is a shell script called `pyflink-udf-runner.sh` step 3: The script `pyflink-udf-runner.sh` will start a child process `python beam_boot.py` step 4: The `beam_boot.py` will also start a child process `python beam_sdk_worker_main.py` step 5: In the `beam_sdk_worker_main.py`, it will start a Grpc client to connect to the Grpc Server run in the Java Operator (# step 1) Now the phenomenon of hanging is that the Java Operators(If we set multiple paralism) has one concurrent Grpc Server that does not receive the connection from the Grpc client on the Python side. By grabbing the background process, we can see the following phenomenas: 1. `pyflink-udf-runner.sh` hangs in running `python beam_boot.py` which can be explained. 2. The process `python beam_sdk_worker_main.py` is not running, and through the log, we can confirm that the Grpc Client does not start up, which explains the phenomenon that the Java client hangs. 3. According to the printed log, the execution of `python beam_boot.py` has been finished, but this can't explain why `python beam_sdk_worker_main.py` did not start successfully. 4. Observing the process running in the background, `python beam_boot.py` is still running. 5. Two `python beam_boot.py` process appeared in the background, and one of them was a child process of another one. At present, the phenomenas 3, 4, and 5 have no way to explain clearly. so I have added the code to output the stack method of the python process to print out the status of `python beam_boot.py`. However, an experiment needs to trigger multiple Azure Labs at the same time. That is because it is not a stable trigger, so it takes a lot of time to get the results of an experiment. > python tests hang on Azure > -------------------------- > > Key: FLINK-23493 > URL: https://issues.apache.org/jira/browse/FLINK-23493 > Project: Flink > Issue Type: Bug > Components: API / Python > Affects Versions: 1.14.0, 1.13.1, 1.12.4, 1.15.0 > Reporter: Dawid Wysakowicz > Assignee: Huang Xingbo > Priority: Blocker > Labels: test-stability > Fix For: 1.15.0, 1.14.1 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20898&view=logs&j=821b528f-1eed-5598-a3b4-7f748b13f261&t=4fad9527-b9a5-5015-1b70-8356e5c91490&l=22829 -- This message was sent by Atlassian Jira (v8.20.1#820001)