Yes I'm unclear on how a PCollection with ExternalCoder made it into a downstream transform that enforces is_deterministic. My understanding of ExternalCoder (admittedly just based on a quick look at commit history) is that it's a shim added so the Python SDK can handle coders that are internal to cross-language transforms. I think that if the Python SDK is trying to introspect an ExternalCoder instance then something is wrong.
Brian On Tue, May 19, 2020 at 4:01 PM Luke Cwik <lc...@google.com> wrote: > I see. The problem is that you are trying to know certain properties of > the coder to use in a downstream transform which enforces that it is > deterministic like GroupByKey. > > In all the scenarios so far that I have seen we have required both SDKs to > understand the coder, how are you having a cross language pipeline where > the downstream SDK doesn't understand the coder and works? > > Also, an alternative strategy would be to tell the expansion service that > you need to choose a coder that is deterministic on the output. This would > require building the pipeline and before submission to the job server > perform the expansion telling it all the limitations that the SDK has > imposed on it. > > > > > On Tue, May 19, 2020 at 3:45 PM Sam Rohde <sro...@google.com> wrote: > >> Hi all, >> >> Should there be more metadata in the Coder Proto? For example, adding an >> "is_deterministic" boolean field. This will allow for a language-agnostic >> way for SDKs to infer properties about a coder received from the expansion >> service. >> >> My motivation for this is that I recently ran into a problem in which an >> "ExternalCoder" in the Python SDK was erroneously marked as >> non-deterministic. The reason being is that the Coder proto doesn't have an >> "is_deterministic" and when the coder fails to be recreated in Python, the >> ExternalCoder defaults to False. >> >> Regards, >> Sam >> >>