[
https://issues.apache.org/jira/browse/BEAM-10762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kyle Weaver reassigned BEAM-10762:
----------------------------------
Assignee: Kyle Weaver
> Beam Python on Flink fails when no artifacts staged
> ---------------------------------------------------
>
> Key: BEAM-10762
> URL: https://issues.apache.org/jira/browse/BEAM-10762
> Project: Beam
> Issue Type: Bug
> Components: java-fn-execution, sdk-py-harness
> Affects Versions: 2.23.0
> Reporter: Charles Chen
> Assignee: Kyle Weaver
> Priority: P1
>
> When a Beam job with no artifacts staged is deployed on a Beam-on-Flink
> cluster (without the JobServer), it crashes the Python beam worker pool and
> does not recover. This causes that job (and subsequent jobs that would have
> used that task manager slot) to hang and fail. Strangely, if a Beam job
> *with* artifacts staged is run on that Beam worker pool container instance
> (i.e. using that task manager slot), subsequent jobs *without* artifacts
> staged that end up using that task manager slot run properly and succeed.
> When this happens, the error from the worker pool container looks like this:
> {code:java}
> 2020-08-19 01:58:12.287 PDT
> 2020/08/19 08:58:12 Initializing python harness: /opt/apache/beam/boot
> --id=1-1 --logging_endpoint=localhost:38839
> --artifact_endpoint=localhost:43187 --provision_endpoint=localhost:37259
> --control_endpoint=localhost:34931
> 2020-08-19 01:58:12.305 PDT
> 2020/08/19 08:58:12 Failed to retrieve staged files: failed to get manifest
> 2020-08-19 01:58:12.305 PDT
> caused by:
> 2020-08-19 01:58:12.305 PDT
> rpc error: code = Unimplemented desc = Method not found:
> org.apache.beam.model.job_management.v1.LegacyArtifactRetrievalService/GetManifest
> {code}
> To reproduce this, it is sufficient to run the wordcount example modified to
> set the PipelineOptions flag `save_main_session=False` with Beam 2.23.
> {code:java}
> python -m apache_beam.examples.wordcount --runner=FlinkRunner
> --flink_master=$FLINK_MASTER_HOST:8081 --flink_submit_uber_jar
> --environment_type=EXTERNAL --environment_config=localhost:50000 --input
> gs://dataflow-samples/shakespeare/kinglear.txt --output
> gs://my/bucket/location
> {code}
> This was tested in a Beam-on-Flink cluster deployed on Kubernetes
> ([https://github.com/GoogleCloudPlatform/flink-on-k8s-operator]) with a
> modified version of this patch to use Beam 2.23
> ([https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/pull/301],
> i.e. with instances of "2.22" replaced with "2.23").
--
This message was sent by Atlassian Jira
(v8.3.4#803005)