[
https://issues.apache.org/jira/browse/BEAM-4271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kenneth Knowles updated BEAM-4271:
----------------------------------
This Jira ticket has a pull request attached to it, but is still open. Did the
pull request resolve the issue? If so, could you please mark it resolved? This
will help the project have a clear view of its open issues.
> Executable stages allow side input coders to be set and/or queried
> ------------------------------------------------------------------
>
> Key: BEAM-4271
> URL: https://issues.apache.org/jira/browse/BEAM-4271
> Project: Beam
> Issue Type: Bug
> Components: runner-core
> Reporter: Ben Sidhom
> Priority: P3
> Labels: portability
> Time Spent: 4h 10m
> Remaining Estimate: 0h
>
> ProcessBundleDescriptors may contain side input references from inner
> PTransforms. These side inputs do not have explicit coders; instead, SDK
> harnesses use the PCollection coders by default.
> Using the default PCollection coder as specified at pipeline construction is
> in general not the correct thing to do. When PCollection elements are
> materialized, any coders unknown to a runner a length-prefixed. This means
> that materialized PCollections do not use their original element coders. Side
> inputs are delivered to SDKs via MultimapSideInput StateRequests. The
> responses to these requests are expected to contain all of the values for a
> given key (and window), coded with the PCollection KV.value coder,
> concatenated. However, at the time of serving these requests on the runner
> side, we do not have enough information to reconstruct the original value
> coders.
> There are different ways to address this issue. For example:
> * Modify the associated PCollection coder to match the coder that the runner
> uses to materialize elements. This means that anywhere a given PCollection is
> used within a given bundle, it will use the runner-safe coder. This may
> introduce inefficiencies but should be "correct".
> * Annotate side inputs with explicit coders. This guarantees that the key
> and value coders used by the runner match the coders used by SDKs.
> Furthermore, it allows the _runners_ to specify coders. This involves changes
> to the proto models and all SDKs.
> * Annotate side input state requests with both key and value coders. This
> inverts the expected responsibility and has the SDK determine runner coders.
> Additionally, because runners do not understand all SDK types, additional
> coder substitution will need to be done at request handling time to make sure
> that the requested coder can be instantiated and will remain consistent with
> the SDK coder. This requires only small changes to SDKs because they may opt
> to use their default PCollection coders.
> All of the these approaches have their own downsides. Explicit side input
> coders is probably the right thing to do long-term, but the simplest change
> for now is to modify PCollection coders to match exactly how they're
> materialized.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)