[
https://issues.apache.org/jira/browse/BEAM-13795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486652#comment-17486652
]
Kenneth Knowles commented on BEAM-13795:
----------------------------------------
This looks like an internal Dataflow error. Can you run the pipeline on another
runner?
> `beam.CombineValues` on DataFlow runner causes ambiguous failure with python
> SDK
> --------------------------------------------------------------------------------
>
> Key: BEAM-13795
> URL: https://issues.apache.org/jira/browse/BEAM-13795
> Project: Beam
> Issue Type: Bug
> Components: runner-dataflow, sdk-py-core
> Affects Versions: 2.35.0
> Environment: Can provide Dockerfile, pyproject.toml, poetry.lock
> files on request.
> Using Apache Beam 2.35.0 with GCP extras, on Python 3.8.10.
> Reporter: Jake Zuliani
> Priority: P2
> Labels: GCP, newbie
>
>
> The following beam pipeline works correctly using `DirectRunner` but fails
> with a very vague error when using `DataflowRunner`.
> {code:java}
> (
> pipeline
> | beam.io.ReadFromPubSub(input_topic, with_attributes=True)
> | beam.Map(pubsub_message_to_row)
> | beam.WindowInto(beam.transforms.window.FixedWindows(5))
> | beam.GroupBy(<beam.Row col name>)
> | beam.CombineValues(<instance of beam.CombineFn subclass>)
> | beam.Values()
> | beam.io.gcp.bigquery.WriteToBigQuery( . . . )
> ){code}
> Stacktrace:
> {code:java}
> Traceback (most recent call last):
> File "src/read_quality_pipeline/__init__.py", line 128, in <module>
> (
> File
> "/home/pkg_dev/.cache/pypoetry/virtualenvs/apache-beam-poc-5nxBvN9R-py3.8/lib/python3.8/site-packages/apache_beam/pipeline.py",
> line 597, in __exit__
> self.result.wait_until_finish()
> File
> "/home/pkg_dev/.cache/pypoetry/virtualenvs/apache-beam-poc-5nxBvN9R-py3.8/lib/python3.8/site-packages/apache_beam/runners/dataflow/dataflow_runner.py",
> line 1633, in wait_until_finish
> raise DataflowRuntimeException(
> apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException:
> Dataflow pipeline failed. State: FAILED, Error:
> Error processing pipeline. {code}
> Log output:
> {code:java}
> 2022-02-01T16:54:43.645Z: JOB_MESSAGE_WARNING: Autoscaling is enabled for
> Dataflow Streaming Engine. Workers will scale between 1 and 100 unless
> maxNumWorkers is specified.
> 2022-02-01T16:54:43.736Z: JOB_MESSAGE_DETAILED: Autoscaling is enabled for
> job 2022-02-01_08_54_40-8791019287477103665. The number of workers will be
> between 1 and 100.
> 2022-02-01T16:54:43.757Z: JOB_MESSAGE_DETAILED: Autoscaling was automatically
> enabled for job 2022-02-01_08_54_40-8791019287477103665.
> 2022-02-01T16:54:44.624Z: JOB_MESSAGE_ERROR: Error processing pipeline. {code}
> With the `CombineValues` step removed this pipeline successfully starts in
> dataflow.
>
> I thought this was an issue with Dataflow on the server side since the
> Dataflow API (v1b3.projects.locations.jobs.messages) is just returning the
> textPayload: "Error processing pipeline". But then I found the issue
> BEAM-12636 where a go SDK user has the same error message but seemingly as a
> result of bugs in the go SDK?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)