Jake Zuliani created BEAM-13795:
-----------------------------------

             Summary: `beam.CombineValues` on DataFlow runner causes ambiguous 
failure with python SDK
                 Key: BEAM-13795
                 URL: https://issues.apache.org/jira/browse/BEAM-13795
             Project: Beam
          Issue Type: Bug
          Components: runner-dataflow, sdk-py-core
    Affects Versions: 2.35.0
         Environment: Can provide Dockerfile, pyproject.toml, poetry.lock files 
on request.
Using Apache Beam 2.35.0 with GCP extras, on Python 3.8.10.
            Reporter: Jake Zuliani


 

The following beam pipeline works correctly using `DirectRunner` but fails with 
a very vague error when using `DataflowRunner`.
{code:java}
(    
pipeline    
| beam.io.ReadFromPubSub(input_topic, with_attributes=True)    
| beam.Map(pubsub_message_to_row)    
| beam.WindowInto(beam.transforms.window.FixedWindows(5))    
| beam.GroupBy(<beam.Row col name>)    
| beam.CombineValues(<instance of beam.CombineFn subclass>)    
| beam.Values()  
| beam.io.gcp.bigquery.WriteToBigQuery( . . . )
){code}
Stacktrace:
{code:java}
Traceback (most recent call last):
  File "src/read_quality_pipeline/__init__.py", line 128, in <module>
    (
  File 
"/home/pkg_dev/.cache/pypoetry/virtualenvs/apache-beam-poc-5nxBvN9R-py3.8/lib/python3.8/site-packages/apache_beam/pipeline.py",
 line 597, in __exit__
    self.result.wait_until_finish()
  File 
"/home/pkg_dev/.cache/pypoetry/virtualenvs/apache-beam-poc-5nxBvN9R-py3.8/lib/python3.8/site-packages/apache_beam/runners/dataflow/dataflow_runner.py",
 line 1633, in wait_until_finish
    raise DataflowRuntimeException(
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow 
pipeline failed. State: FAILED, Error:
Error processing pipeline. {code}
Log output:
{code:java}
2022-02-01T16:54:43.645Z: JOB_MESSAGE_WARNING: Autoscaling is enabled for 
Dataflow Streaming Engine. Workers will scale between 1 and 100 unless 
maxNumWorkers is specified.
2022-02-01T16:54:43.736Z: JOB_MESSAGE_DETAILED: Autoscaling is enabled for job 
2022-02-01_08_54_40-8791019287477103665. The number of workers will be between 
1 and 100.
2022-02-01T16:54:43.757Z: JOB_MESSAGE_DETAILED: Autoscaling was automatically 
enabled for job 2022-02-01_08_54_40-8791019287477103665.
2022-02-01T16:54:44.624Z: JOB_MESSAGE_ERROR: Error processing pipeline. {code}
With the `CombineValues` step removed this pipeline successfully starts in 
dataflow.

 

I thought this was an issue with Dataflow on the server side since the Dataflow 
API (v1b3.projects.locations.jobs.messages) is just returning the textPayload: 
"Error processing pipeline". But then I found the issue BEAM-12636 where a go 
SDK user has the same error message but seemingly as a result of bugs in the go 
SDK?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to