[
https://issues.apache.org/jira/browse/BEAM-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422156#comment-17422156
]
Jens Wiren commented on BEAM-11959:
-----------------------------------
I managed to add some verbose flags to pip and compiled the boot.go.
It seems as the issue is actually in the go exec call. Pip finishes the install
(if I pass a tar.gz, not a whl) but the go cmd.run() doesn't return.
If I kubectl exec into the container I can see that the package is listed when
running pip freeze indicating that pip install actually finished successfully
but go's run.cmd() never returned and hence it just keeps awaiting completion
from an already completed call.
> Python Beam SDK Harness hangs when installing pip packages
> ----------------------------------------------------------
>
> Key: BEAM-11959
> URL: https://issues.apache.org/jira/browse/BEAM-11959
> Project: Beam
> Issue Type: Bug
> Components: runner-flink, sdk-py-harness
> Affects Versions: 2.27.0, 2.28.0, 2.31.0, 2.32.0
> Environment: Kubernetes v1.20.1
> Reporter: Jens Wiren
> Priority: P1
> Attachments: jobmanager-configmap.yaml, jobmanager-deploy.yaml,
> jobmanager-svc.yaml, taskmanager-deploy.yaml
>
>
> When running a Beam pipeline using Flink as backend, the python sdk harness
> hangs when trying to install pip packages. Tested using Flink 1.10.3.
> Images used:
> apache/beam_python3.7_sdk:2.28.0
> apache/flink:1.10.3
> Beam args used are:
> "--runner=FlinkRunner",
> "–flink_version=1.10", //same with 1.13
>
> "--flink_master=[http://flink-jobmanager.default:8081|http://flink-jobmanager.default:8081/]",
> f"--artifacts_dir=/mnt/flink",
> "--environment_type=EXTERNAL",
> "--environment_config=localhost:50000",
>
> Specifically this was tested by running a TFX pipeline which gets submitted
> and registered as it should, but the SDK Harness hangs when installing:
> 2021/03/10 12:16:20 Initializing python harness: /opt/apache/beam/boot
> --id=1-1 --logging_endpoint=localhost:39795
> --artifact_endpoint=localhost:34095 --provision_endpoint=localhost:42999
> --control_endpoint=localhost:38129
> 2021/03/10 12:16:20 Found artifact: tfx_ephemeral-0.27.0.tar.gz
> 2021/03/10 12:16:20 Found artifact: extra_packages.txt
> 2021/03/10 12:16:20 Installing setup packages ...
> 2021/03/10 12:16:20 Installing extra package: tfx_ephemeral-0.27.0.tar.gz
> and nothing else is shown irregardless how long it is left. I can manually
> install the TFX package by exec into the container in < 3 min.
> The Flink task-manager then waits idling and periodically logs:
> 2021-03-10 11:29:26,287 INFO
> org.apache.beam.runners.fnexecution.environment.ExternalEnvironmentFactory -
> Still waiting for startup of environment from localhost:50000 for worker id
> 1-1
> Helm charts attached below.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)