[ 
https://issues.apache.org/jira/browse/FLINK-23493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450814#comment-17450814
 ] 

Huang Xingbo commented on FLINK-23493:
--------------------------------------

I'm very sorry, because there is something wrong with my mail client, the mail 
was not received in time. Let me share with you the current progress.

The reasons for these hanging tests are the same. Let me first talk about the 
steps of the start-up phase of the operator that runs the Python UDF whichi can 
help know why the progress hangs:
    step 1: The Java operator will start a Grpc Server.
    step 2: The Java Operator will start a child process which is a shell 
script called `pyflink-udf-runner.sh`
    step 3: The script `pyflink-udf-runner.sh` will start a child process 
`python beam_boot.py`
    step 4: The `beam_boot.py` will also start a child process `python 
beam_sdk_worker_main.py`
    step 5: In the `beam_sdk_worker_main.py`, it will start a Grpc client  to 
connect to the Grpc Server run in the Java Operator (# step 1)

Now the phenomenon of hanging is that the Java Operators(If we set multiple 
paralism) has one concurrent Grpc Server that does not receive the connection 
from the Grpc client on the Python side. By grabbing the background process, we 
can see the following phenomenas:
    1. `pyflink-udf-runner.sh` hangs in running `python beam_boot.py` which can 
be explained.
    2. The process `python beam_sdk_worker_main.py` is not running, and through 
the log, we can confirm that the Grpc Client does not start up, which explains 
the phenomenon that the Java client hangs.
    3. According to the printed log, the execution of `python beam_boot.py` has 
been finished, but this can't explain why `python beam_sdk_worker_main.py` did 
not start successfully.
    4. Observing the process running in the background, `python beam_boot.py` 
is still running.
    5. Two `python beam_boot.py` process appeared in the background, and one of 
them was a child process of another one.

At present, the phenomenas 3, 4, and 5 have no way to explain clearly. so I 
have added the code to output the stack method of the python process to print 
out the status of `python beam_boot.py`. However, an experiment needs to 
trigger multiple Azure Labs at the same time. That is because it is not a 
stable trigger, so it takes a lot of time to get the results of an experiment. 







> python tests hang on Azure
> --------------------------
>
>                 Key: FLINK-23493
>                 URL: https://issues.apache.org/jira/browse/FLINK-23493
>             Project: Flink
>          Issue Type: Bug
>          Components: API / Python
>    Affects Versions: 1.14.0, 1.13.1, 1.12.4, 1.15.0
>            Reporter: Dawid Wysakowicz
>            Assignee: Huang Xingbo
>            Priority: Blocker
>              Labels: test-stability
>             Fix For: 1.15.0, 1.14.1
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20898&view=logs&j=821b528f-1eed-5598-a3b4-7f748b13f261&t=4fad9527-b9a5-5015-1b70-8356e5c91490&l=22829



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to