Yichi Zhang created BEAM-11154:
----------------------------------
Summary: Missing coder in pipeline components with dataflow runner
v2
Key: BEAM-11154
URL: https://issues.apache.org/jira/browse/BEAM-11154
Project: Beam
Issue Type: Bug
Components: runner-dataflow
Reporter: Yichi Zhang
Assignee: Yichi Zhang
When running pipelines with Top combine function on dataflow runner v2, the
backend complains about missing coder id for example missing BoundedHeapCoder1.
After some troubleshooting this problem seems more generic:
The step context translation phase would not recognize already registered Coder
with incorrect hashCode() function, and will try to give it a new uniqified
name to the pipeline_proto_coder_id,
code pointers:
https://github.com/apache/beam/blob/5675108933de6eb601ca2e4f21870d2ababe0ec7/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SdkComponents.java#L268
In this case, since the comparator field in BoundedHeapCoder often does not
implement hashCode() and equals() the BoundedHeapCoder will also have a
different hashCode() each time a new instance is created. The duplicated coder
does not exist in already translated pipeline proto and will lead to the
aforementioned missing coder id issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)