Re: [PR] Preserve pane index through reshuffle. [beam]

via GitHub Mon, 31 Mar 2025 12:35:21 -0700


lostluck commented on code in PR #34348:
URL: https://github.com/apache/beam/pull/34348#discussion_r2021658318



##########
sdks/go/pkg/beam/runners/prism/internal/urns/urns.go:
##########
@@ -124,6 +124,7 @@ var (
        CoderTimer              = cdrUrn(pipepb.StandardCoders_TIMER)
 
        CoderKV                  = cdrUrn(pipepb.StandardCoders_KV)
+       CoderTuple               = "beam:coder:tuple:v1"

Review Comment:
   It's as Claude says.
   
   Basically the Python Tuple coder is an outlier: It pretends to be a standard 
beam coder, with an arbitrary number of components, and Python plays fast and 
loose with the notion of coder types. No other "custom coder" uses exposed sub 
components, essentially. Custom coders are usually fully opaque.
   
   The problem here is I tried to avoid needing to enumerate all Beam coders 
with sub components that needed processing. We already had a "leaf" list, why 
do we also need a "composite" list?  That means there are two approaches:
   
   1. This current PR's approach: Promote the janky python approach to be a 
known thing for all runners/SDKs. Since the URN is already pretending to be a 
standard URN, this isn't too hard, and it permits other SDKs to interoperate 
with that coder. AKA, turning the exception to be part of the standard.
   
   2. We are forced to specify known composite coders to avoid Length Prefixing 
them unnecessarily.
   So instead of *just* the set of Known Leaf Coders, we would have the set of 
Known Composite Coders, that don't need length prefixing. Anything else should 
just be length prefixed.
   
   Eg. We add a list of the known Composite URNs that should not be length 
prefixed by the
   
   Existing Leaf list:  
   
https://github.com/apache/beam/blob/f1bc5091330e9faeeab0066a4df61fa421ba1b90/sdks/go/pkg/beam/runners/prism/internal/coders.go#L38
 
   
   Where the check should go, so it's the same logic for wrapping unknowns. eg. 
https://github.com/apache/beam/blob/f1bc5091330e9faeeab0066a4df61fa421ba1b90/sdks/go/pkg/beam/runners/prism/internal/coders.go#L129
   
   `if (len(c.GetComponentCoderIds()) == 0 && !leaf) || 
!isKnownCompositeCoder(c)  {`
   
   This has two risks though:
   
   1. Changed length prefixing behavior, may mean tests that are currently 
passing will fail. It'll be important to run the Java suite locally (I don't 
trust the Github action to run uncached when it's just a prism side change. 
Only noticed that after I got re-orged. If it takes less than 20m to run, it 
was cached and can't be trusted).
   
   2. The converse issue: What if the Python SDK doesn't know how to deal with 
a runner side wrapped length prefixed tuple coder? Then a Python fix would be 
needed.  This would hopefully be evident in the test suite uses of tuple coder.
   
   (There's a issue with Java Row coders failing deep in the Java SDK when 
they're wrapped in a length prefix. The introspection doesn't know how to skip 
the LP wrapper (see #32931).
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Preserve pane index through reshuffle. [beam]

Reply via email to