I have a PR that makes GBK a primitive in which the test_combine_globally <https://github.com/apache/beam/blob/10dc1bb683aa9c219397cb3474b676a4fbac5a0e/sdks/python/apache_beam/transforms/validate_runner_xlang_test.py#L162> is failing on the DataflowRunner. In particular, the DataflowRunner runs over the transform in the run_pipeline method. I moved a method that verifies that coders as inputs to GBKs are deterministic during this run_pipeline. Previously, this was during the apply_GroupByKey.
On Tue, May 19, 2020 at 4:48 PM Brian Hulette <bhule...@google.com> wrote: > Yes I'm unclear on how a PCollection with ExternalCoder made it into a > downstream transform that enforces is_deterministic. My understanding of > ExternalCoder (admittedly just based on a quick look at commit history) is > that it's a shim added so the Python SDK can handle coders that are > internal to cross-language transforms. > I think that if the Python SDK is trying to introspect an ExternalCoder > instance then something is wrong. > > Brian > > On Tue, May 19, 2020 at 4:01 PM Luke Cwik <lc...@google.com> wrote: > >> I see. The problem is that you are trying to know certain properties of >> the coder to use in a downstream transform which enforces that it is >> deterministic like GroupByKey. >> >> In all the scenarios so far that I have seen we have required both SDKs >> to understand the coder, how are you having a cross language pipeline where >> the downstream SDK doesn't understand the coder and works? >> >> Also, an alternative strategy would be to tell the expansion service that >> you need to choose a coder that is deterministic on the output. This would >> require building the pipeline and before submission to the job server >> perform the expansion telling it all the limitations that the SDK has >> imposed on it. >> >> >> >> >> On Tue, May 19, 2020 at 3:45 PM Sam Rohde <sro...@google.com> wrote: >> >>> Hi all, >>> >>> Should there be more metadata in the Coder Proto? For example, adding an >>> "is_deterministic" boolean field. This will allow for a language-agnostic >>> way for SDKs to infer properties about a coder received from the expansion >>> service. >>> >>> My motivation for this is that I recently ran into a problem in which an >>> "ExternalCoder" in the Python SDK was erroneously marked as >>> non-deterministic. The reason being is that the Coder proto doesn't have an >>> "is_deterministic" and when the coder fails to be recreated in Python, the >>> ExternalCoder defaults to False. >>> >>> Regards, >>> Sam >>> >>>