kennknowles opened a new issue, #18941:
URL: https://github.com/apache/beam/issues/18941

   ProcessBundleDescriptors may contain side input references from inner 
PTransforms. These side inputs do not have explicit coders; instead, SDK 
harnesses use the PCollection coders by default.
   
   Using the default PCollection coder as specified at pipeline construction is 
in general not the correct thing to do. When PCollection elements are 
materialized, any coders unknown to a runner a length-prefixed. This means that 
materialized PCollections do not use their original element coders. Side inputs 
are delivered to SDKs via MultimapSideInput StateRequests. The responses to 
these requests are expected to contain all of the values for a given key (and 
window), coded with the PCollection KV.value coder, concatenated. However, at 
the time of serving these requests on the runner side, we do not have enough 
information to reconstruct the original value coders.
   
   There are different ways to address this issue. For example:
    * Modify the associated PCollection coder to match the coder that the 
runner uses to materialize elements. This means that anywhere a given 
PCollection is used within a given bundle, it will use the runner-safe coder. 
This may introduce inefficiencies but should be "correct".
    * Annotate side inputs with explicit coders. This guarantees that the key 
and value coders used by the runner match the coders used by SDKs. Furthermore, 
it allows the _runners_ to specify coders. This involves changes to the proto 
models and all SDKs.
    * Annotate side input state requests with both key and value coders. This 
inverts the expected responsibility and has the SDK determine runner coders. 
Additionally, because runners do not understand all SDK types, additional coder 
substitution will need to be done at request handling time to make sure that 
the requested coder can be instantiated and will remain consistent with the SDK 
coder. This requires only small changes to SDKs because they may opt to use 
their default PCollection coders.
   
   All of the these approaches have their own downsides. Explicit side input 
coders is probably the right thing to do long-term, but the simplest change for 
now is to modify PCollection coders to match exactly how they're materialized.
   
   Imported from Jira 
[BEAM-4271](https://issues.apache.org/jira/browse/BEAM-4271). Original Jira may 
contain additional context.
   Reported by: bsidhom.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to