[ 
https://issues.apache.org/jira/browse/BEAM-10762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver reassigned BEAM-10762:
----------------------------------

    Assignee: Kyle Weaver

> Beam Python on Flink fails when no artifacts staged
> ---------------------------------------------------
>
>                 Key: BEAM-10762
>                 URL: https://issues.apache.org/jira/browse/BEAM-10762
>             Project: Beam
>          Issue Type: Bug
>          Components: java-fn-execution, sdk-py-harness
>    Affects Versions: 2.23.0
>            Reporter: Charles Chen
>            Assignee: Kyle Weaver
>            Priority: P1
>
> When a Beam job with no artifacts staged is deployed on a Beam-on-Flink 
> cluster (without the JobServer), it crashes the Python beam worker pool and 
> does not recover.  This causes that job (and subsequent jobs that would have 
> used that task manager slot) to hang and fail.  Strangely, if a Beam job 
> *with* artifacts staged is run on that Beam worker pool container instance 
> (i.e. using that task manager slot), subsequent jobs *without* artifacts 
> staged that end up using that task manager slot run properly and succeed.
> When this happens, the error from the worker pool container looks like this:
> {code:java}
> 2020-08-19 01:58:12.287 PDT
> 2020/08/19 08:58:12 Initializing python harness: /opt/apache/beam/boot 
> --id=1-1 --logging_endpoint=localhost:38839 
> --artifact_endpoint=localhost:43187 --provision_endpoint=localhost:37259 
> --control_endpoint=localhost:34931
> 2020-08-19 01:58:12.305 PDT
> 2020/08/19 08:58:12 Failed to retrieve staged files: failed to get manifest
> 2020-08-19 01:58:12.305 PDT
> caused by:
> 2020-08-19 01:58:12.305 PDT
> rpc error: code = Unimplemented desc = Method not found: 
> org.apache.beam.model.job_management.v1.LegacyArtifactRetrievalService/GetManifest
> {code}
> To reproduce this, it is sufficient to run the wordcount example modified to 
> set the PipelineOptions flag `save_main_session=False` with Beam 2.23.
> {code:java}
> python -m apache_beam.examples.wordcount --runner=FlinkRunner 
> --flink_master=$FLINK_MASTER_HOST:8081 --flink_submit_uber_jar 
> --environment_type=EXTERNAL --environment_config=localhost:50000 --input 
> gs://dataflow-samples/shakespeare/kinglear.txt --output 
> gs://my/bucket/location
> {code}
> This was tested in a Beam-on-Flink cluster deployed on Kubernetes 
> ([https://github.com/GoogleCloudPlatform/flink-on-k8s-operator]) with a 
> modified version of this patch to use Beam 2.23 
> ([https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/pull/301], 
> i.e. with instances of "2.22" replaced with "2.23").



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to