[ 
https://issues.apache.org/jira/browse/BEAM-10762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Chen updated BEAM-10762:
--------------------------------
    Description: 
When a Beam job with no artifacts staged is deployed on a Beam-on-Flink cluster 
(without the JobServer), it crashes the Python beam worker pool and does not 
recover.  This causes that job (and subsequent jobs that would have used that 
task manager slot) to hang and fail.  Strangely, if a Beam job *with* artifacts 
staged is run on that Beam worker pool container instance (i.e. using that task 
manager slot), subsequent jobs *without* artifacts staged that end up using 
that task manager slot run properly and succeed.

When this happens, the error from the worker pool container looks like this:
{code:java}
2020-08-19 01:58:12.287 PDT
2020/08/19 08:58:12 Initializing python harness: /opt/apache/beam/boot --id=1-1 
--logging_endpoint=localhost:38839 --artifact_endpoint=localhost:43187 
--provision_endpoint=localhost:37259 --control_endpoint=localhost:34931
2020-08-19 01:58:12.305 PDT
2020/08/19 08:58:12 Failed to retrieve staged files: failed to get manifest
2020-08-19 01:58:12.305 PDT
caused by:
2020-08-19 01:58:12.305 PDT
rpc error: code = Unimplemented desc = Method not found: 
org.apache.beam.model.job_management.v1.LegacyArtifactRetrievalService/GetManifest
{code}
To reproduce this, it is sufficient to run the wordcount example modified to 
set the PipelineOptions flag `save_main_session=False` with Beam 2.23.
{code:java}
python -m apache_beam.examples.wordcount --runner=FlinkRunner 
--flink_master=$FLINK_MASTER_HOST:8081 --flink_submit_uber_jar 
--environment_type=EXTERNAL --environment_config=localhost:50000 --input 
gs://dataflow-samples/shakespeare/kinglear.txt --output gs://my/bucket/location
{code}
This was tested in a Beam-on-Flink cluster deployed on Kubernetes 
([https://github.com/GoogleCloudPlatform/flink-on-k8s-operator]) with a 
modified version of this patch to use Beam 2.23 
([https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/pull/301], i.e. 
with instances of "2.22" replaced with "2.23").

  was:
When a Beam job with no artifacts staged is deployed on a Beam-on-Flink cluster 
(without the JobServer), it crashes the Python beam worker pool and does not 
recover.  This causes that job (and subsequent jobs that would have used that 
task manager slot) to hang and fail.  Strangely, if a Beam job *with* artifacts 
staged is run on that Beam worker pool container instance (i.e. using that task 
manager slot), subsequent jobs *without* artifacts staged that end up using 
that task manager slot run properly and succeed.

When this happens, the error from the worker pool container looks like this:

 
{code:java}
2020-08-19 01:58:12.287 PDT
2020/08/19 08:58:12 Initializing python harness: /opt/apache/beam/boot --id=1-1 
--logging_endpoint=localhost:38839 --artifact_endpoint=localhost:43187 
--provision_endpoint=localhost:37259 --control_endpoint=localhost:34931
2020-08-19 01:58:12.305 PDT
2020/08/19 08:58:12 Failed to retrieve staged files: failed to get manifest
2020-08-19 01:58:12.305 PDT
caused by:
2020-08-19 01:58:12.305 PDT
rpc error: code = Unimplemented desc = Method not found: 
org.apache.beam.model.job_management.v1.LegacyArtifactRetrievalService/GetManifest
{code}
 

To reproduce this, it is sufficient to run the wordcount example modified to 
set the PipelineOptions flag `save_main_session=False` with Beam 2.23.

 
{code:java}
python -m apache_beam.examples.wordcount --runner=FlinkRunner 
--flink_master=$FLINK_MASTER_HOST:8081 --flink_submit_uber_jar 
--environment_type=EXTERNAL --environment_config=localhost:50000 --input 
gs://dataflow-samples/shakespeare/kinglear.txt --output gs://my/bucket/location
{code}
 

This was tested in a Beam-on-Flink cluster deployed on Kubernetes 
([https://github.com/GoogleCloudPlatform/flink-on-k8s-operator]) with a 
modified version of this patch to use Beam 2.23 
([https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/pull/301], i.e. 
with instances of "2.22" replaced with "2.23").


> Beam Python on Flink fails when no artifacts staged
> ---------------------------------------------------
>
>                 Key: BEAM-10762
>                 URL: https://issues.apache.org/jira/browse/BEAM-10762
>             Project: Beam
>          Issue Type: Bug
>          Components: java-fn-execution, sdk-py-harness
>    Affects Versions: 2.23.0
>            Reporter: Charles Chen
>            Priority: P1
>
> When a Beam job with no artifacts staged is deployed on a Beam-on-Flink 
> cluster (without the JobServer), it crashes the Python beam worker pool and 
> does not recover.  This causes that job (and subsequent jobs that would have 
> used that task manager slot) to hang and fail.  Strangely, if a Beam job 
> *with* artifacts staged is run on that Beam worker pool container instance 
> (i.e. using that task manager slot), subsequent jobs *without* artifacts 
> staged that end up using that task manager slot run properly and succeed.
> When this happens, the error from the worker pool container looks like this:
> {code:java}
> 2020-08-19 01:58:12.287 PDT
> 2020/08/19 08:58:12 Initializing python harness: /opt/apache/beam/boot 
> --id=1-1 --logging_endpoint=localhost:38839 
> --artifact_endpoint=localhost:43187 --provision_endpoint=localhost:37259 
> --control_endpoint=localhost:34931
> 2020-08-19 01:58:12.305 PDT
> 2020/08/19 08:58:12 Failed to retrieve staged files: failed to get manifest
> 2020-08-19 01:58:12.305 PDT
> caused by:
> 2020-08-19 01:58:12.305 PDT
> rpc error: code = Unimplemented desc = Method not found: 
> org.apache.beam.model.job_management.v1.LegacyArtifactRetrievalService/GetManifest
> {code}
> To reproduce this, it is sufficient to run the wordcount example modified to 
> set the PipelineOptions flag `save_main_session=False` with Beam 2.23.
> {code:java}
> python -m apache_beam.examples.wordcount --runner=FlinkRunner 
> --flink_master=$FLINK_MASTER_HOST:8081 --flink_submit_uber_jar 
> --environment_type=EXTERNAL --environment_config=localhost:50000 --input 
> gs://dataflow-samples/shakespeare/kinglear.txt --output 
> gs://my/bucket/location
> {code}
> This was tested in a Beam-on-Flink cluster deployed on Kubernetes 
> ([https://github.com/GoogleCloudPlatform/flink-on-k8s-operator]) with a 
> modified version of this patch to use Beam 2.23 
> ([https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/pull/301], 
> i.e. with instances of "2.22" replaced with "2.23").



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to