[
https://issues.apache.org/jira/browse/BEAM-14153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514282#comment-17514282
]
Robert Burke commented on BEAM-14153:
-------------------------------------
Now that I've got a chance to look at this, the right move is to serialize the
coder used for encodings/decodings with the reshuffle transforms. This bug, and
the similar one with CoGBK inject both stem from this.
So the fix would be to add a payload for reshuflle (and extend the one for
inject) that will contain the serialized coder protos, and use those for
encoding/decoding the types, instead of pulling the coders from the PTransforms.
A check should be added to ensure that the element types of the incoming and
outgoing PCollections match the decoded elements, and this will avoid the issue
around recoding the types, leading to the bug.
The construct is also one of the rare cases we could add an optimized step in
the SDK to *avoid* doing any decoding/encoding in the stage if the types of the
received encoded values from the DataSource are the same as what needs to be
sent to the DataSink (modulo the reshuffle key) are the same and no user dofns
are in the way. That would be a different JIRA ticket though.
> Reshuffled Row Coder PCollection used direct to Side Input breaks Dataflow &
> PyPortable
> ---------------------------------------------------------------------------------------
>
> Key: BEAM-14153
> URL: https://issues.apache.org/jira/browse/BEAM-14153
> Project: Beam
> Issue Type: Bug
> Components: sdk-go
> Affects Versions: 2.37.0, 2.38.0
> Reporter: Robert Burke
> Assignee: Robert Burke
> Priority: P2
> Fix For: 2.39.0
>
>
> Since First class Iterable side inputs were implemented, passing a reshuffled
> PCollection directly to a Side Input will cause a coder mismatch between
> encoding the reshuffle and decoding it on Dataflow and on Python Portable. In
> particular, the Row values will be encoded without a Length Prefix, but then
> be requested to decode them with a length prefix, which wasn't included.
> This is similar to the issue in BEAM-12438 which has been hacked around.
> In this instance it's likely more resilient to always length prefix Row
> encoded types, and make it explicit in the pipeline proto. This should avoid
> issues with runners having odd behaviors WRT row coders at this time, while
> not preventing them from introspecting row encoded values should they chose.
> This may also allow us to avoid the hack for BEAM-12438, though that is
> something to be verified independently.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)