[
https://issues.apache.org/jira/browse/BEAM-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422758#comment-17422758
]
Jens Wiren commented on BEAM-11959:
-----------------------------------
[~tvalentyn]
So I have tried the first point you mention:
* running locally a simple go snippet to install pypi packages using pip
* same but inside the image apache/beam_python3.7_sdk:2.32.0 which is the
exact one being used in our k8s setup
* exec:in into the running beam worker pod (where pip has hanged) and running
the same installation
All of these complete successfully.
However, it did highlight that pip does not actually complete when it runs in
the boot binary in the beam worker pod. It hangs on the very final steps of the
installation when it does some clean up. This is the output when it hangs in
the k8s pod:
{code:java}
adding 'model_pipeline-0.0.0.dist-info/METADATA'
adding 'model_pipeline-0.0.0.dist-info/WHEEL'
adding 'model_pipeline-0.0.0.dist-info/top_level.txt'
adding 'model_pipeline-0.0.0.dist-info/RECORD'
removing build/bdist.linux-x86_64/wheel
Running command python setup.py egg_info
running egg_info
creating /tmp/pip-pip-egg-info-_vxd45ww/model_pipeline.egg-info
writing /tmp/pip-pip-egg-info-_vxd45ww/model_pipeline.egg-info/PKG-INFO
writing dependency_links to
/tmp/pip-pip-egg-info-_vxd45ww/model_pipeline.egg-info/dependency_links.txt
writing requirements to
/tmp/pip-pip-egg-info-_vxd45ww/model_pipeline.egg-info/requires.txt
writing top-level names to
/tmp/pip-pip-egg-info-_vxd45ww/model_pipeline.egg-info/top_level.txt
writing manifest file
'/tmp/pip-pip-egg-info-_vxd45ww/model_pipeline.egg-info/SOURCES.txt'
reading manifest file
'/tmp/pip-pip-egg-info-_vxd45ww/model_pipeline.egg-info/SOURCES.txt'
writing manifest file
'/tmp/pip-pip-egg-info-_vxd45ww/model_pipeline.egg-info/SOURCES.txt'
{code}
and when running in the very same image locally:
{code:java}
adding 'model_pipeline-0.0.0.dist-info/METADATA'
adding 'model_pipeline-0.0.0.dist-info/WHEEL'
adding 'model_pipeline-0.0.0.dist-info/top_level.txt'
adding 'model_pipeline-0.0.0.dist-info/RECORD'
removing build/bdist.linux-x86_64/wheel
done
Created wheel for model-pipeline:
filename=model_pipeline-0.0.0-py3-none-any.whl size=137418
sha256=338fea9a5210cfba40eeefdfd498f13364f6d5d136a4788c8a12d8dc3d5b1c2e
Stored in directory:
/root/.cache/pip/wheels/9f/1a/e2/893f3f472147030e04d53da509ed47d77065aabb5bd4a949f1
Successfully built model-pipeline
Installing collected packages: model-pipelineSuccessfully installed
model-pipeline-0.0.0
Removed build tracker: '/tmp/pip-req-tracker-2z3y7ch4'
{code}
It seems like another command is somehow run in the k8s pod where it executes
{code:java}
Running command python setup.py egg_info{code}
and this hangs for some reason.
> Python Beam SDK Harness hangs when installing pip packages
> ----------------------------------------------------------
>
> Key: BEAM-11959
> URL: https://issues.apache.org/jira/browse/BEAM-11959
> Project: Beam
> Issue Type: Bug
> Components: runner-flink, sdk-py-harness
> Affects Versions: 2.27.0, 2.28.0, 2.31.0, 2.32.0
> Environment: Kubernetes v1.20.1
> Reporter: Jens Wiren
> Priority: P1
> Attachments: jobmanager-configmap.yaml, jobmanager-deploy.yaml,
> jobmanager-svc.yaml, taskmanager-deploy.yaml
>
>
> When running a Beam pipeline using Flink as backend, the python sdk harness
> hangs when trying to install pip packages. Tested using Flink 1.10.3.
> Images used:
> apache/beam_python3.7_sdk:2.28.0
> apache/flink:1.10.3
> Beam args used are:
> "--runner=FlinkRunner",
> "–flink_version=1.10", //same with 1.13
>
> "--flink_master=[http://flink-jobmanager.default:8081|http://flink-jobmanager.default:8081/]",
> f"--artifacts_dir=/mnt/flink",
> "--environment_type=EXTERNAL",
> "--environment_config=localhost:50000",
>
> Specifically this was tested by running a TFX pipeline which gets submitted
> and registered as it should, but the SDK Harness hangs when installing:
> 2021/03/10 12:16:20 Initializing python harness: /opt/apache/beam/boot
> --id=1-1 --logging_endpoint=localhost:39795
> --artifact_endpoint=localhost:34095 --provision_endpoint=localhost:42999
> --control_endpoint=localhost:38129
> 2021/03/10 12:16:20 Found artifact: tfx_ephemeral-0.27.0.tar.gz
> 2021/03/10 12:16:20 Found artifact: extra_packages.txt
> 2021/03/10 12:16:20 Installing setup packages ...
> 2021/03/10 12:16:20 Installing extra package: tfx_ephemeral-0.27.0.tar.gz
> and nothing else is shown irregardless how long it is left. I can manually
> install the TFX package by exec into the container in < 3 min.
> The Flink task-manager then waits idling and periodically logs:
> 2021-03-10 11:29:26,287 INFO
> org.apache.beam.runners.fnexecution.environment.ExternalEnvironmentFactory -
> Still waiting for startup of environment from localhost:50000 for worker id
> 1-1
> Helm charts attached below.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)