[ 
https://issues.apache.org/jira/browse/BEAM-4271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-4271:
----------------------------------

This Jira ticket has a pull request attached to it, but is still open. Did the 
pull request resolve the issue? If so, could you please mark it resolved? This 
will help the project have a clear view of its open issues.

> Executable stages allow side input coders to be set and/or queried
> ------------------------------------------------------------------
>
>                 Key: BEAM-4271
>                 URL: https://issues.apache.org/jira/browse/BEAM-4271
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-core
>            Reporter: Ben Sidhom
>            Priority: P3
>              Labels: portability
>          Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> ProcessBundleDescriptors may contain side input references from inner 
> PTransforms. These side inputs do not have explicit coders; instead, SDK 
> harnesses use the PCollection coders by default.
> Using the default PCollection coder as specified at pipeline construction is 
> in general not the correct thing to do. When PCollection elements are 
> materialized, any coders unknown to a runner a length-prefixed. This means 
> that materialized PCollections do not use their original element coders. Side 
> inputs are delivered to SDKs via MultimapSideInput StateRequests. The 
> responses to these requests are expected to contain all of the values for a 
> given key (and window), coded with the PCollection KV.value coder, 
> concatenated. However, at the time of serving these requests on the runner 
> side, we do not have enough information to reconstruct the original value 
> coders.
> There are different ways to address this issue. For example:
>  * Modify the associated PCollection coder to match the coder that the runner 
> uses to materialize elements. This means that anywhere a given PCollection is 
> used within a given bundle, it will use the runner-safe coder. This may 
> introduce inefficiencies but should be "correct".
>  * Annotate side inputs with explicit coders. This guarantees that the key 
> and value coders used by the runner match the coders used by SDKs. 
> Furthermore, it allows the _runners_ to specify coders. This involves changes 
> to the proto models and all SDKs.
>  * Annotate side input state requests with both key and value coders. This 
> inverts the expected responsibility and has the SDK determine runner coders. 
> Additionally, because runners do not understand all SDK types, additional 
> coder substitution will need to be done at request handling time to make sure 
> that the requested coder can be instantiated and will remain consistent with 
> the SDK coder. This requires only small changes to SDKs because they may opt 
> to use their default PCollection coders.
> All of the these approaches have their own downsides. Explicit side input 
> coders is probably the right thing to do long-term, but the simplest change 
> for now is to modify PCollection coders to match exactly how they're 
> materialized.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to