[
https://issues.apache.org/jira/browse/BEAM-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Burke updated BEAM-11916:
--------------------------------
Resolution: Information Provided
Status: Resolved (was: Open)
> Combine failed on large PCollection of uint64 arrays
> ----------------------------------------------------
>
> Key: BEAM-11916
> URL: https://issues.apache.org/jira/browse/BEAM-11916
> Project: Beam
> Issue Type: Bug
> Components: sdk-go
> Affects Versions: 2.28.0
> Environment: Google Dataflow
> Reporter: Tao Liao
> Priority: P3
> Labels: GCP
> Attachments: dataflow autoscaling.png
>
>
> We came across an issue with the Combine operation with Apache Beam Go SDK
> (v2.28.0), when running a pipeline on Google Cloud Dataflow. Source code:
> https://github.com/le0000000/dataflow_combine
> We understand that the Go SDK is experimental but it would be great if
> someone can help us understand if there’s anything wrong with our code, or if
> there's a bug in the Go SDK or Dataflow. The issue only happens when running
> the pipeline with Google Dataflow, with some large data set. We are trying to
> combine a _PCollection<pairedVec>_, with
> _type pairedVec struct {_
> _Vec1 [1048576]uint64_
> _Vec2 [1048576]uint64_
> _}_
> There are 10,000,000 items in the PCollection. After reading the input file,
> Dataflow scheduled 1000 workers to generate the PCollection, and started to
> do the combination. Then the worker number reduced to almost 1 and lasted for
> a very long time. Eventually the job failed with the following error log:
> 2021-03-02T06:13:40.438112597ZWorkflow failed. Causes:
> S09:CombinePerKey/CoGBK'1/Read+CombinePerKey/main.combineVecFn+CombinePerKey/main.combineVecFn/Extract+beam.dropKeyFn+main.flattenVecFn+textio.Write/beam.addFixedKeyFn+textio.Write/CoGBK/Write
> failed., The job failed because a work item has failed 4 times. Look in
> previous log entries for the cause of each one of the 4 failures. For more
> information, see https://cloud.google.com/dataflow/docs/guides/common-errors.
> The work item was attempted on these workers:
> go-job-1-1614659244459204-03012027-u5s6-harness-q8tx Root cause: The worker
> lost contact with the service.,
> go-job-1-1614659244459204-03012027-u5s6-harness-44hk Root cause: The worker
> lost contact with the service.,
> go-job-1-1614659244459204-03012027-u5s6-harness-05nm Root cause: The worker
> lost contact with the service.,
> go-job-1-1614659244459204-03012027-u5s6-harness-l22w Root cause: The worker
> lost contact with the service.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)