I have a PR that makes GBK a primitive in which the test_combine_globally
<https://github.com/apache/beam/blob/10dc1bb683aa9c219397cb3474b676a4fbac5a0e/sdks/python/apache_beam/transforms/validate_runner_xlang_test.py#L162>
is failing on the DataflowRunner. In particular, the DataflowRunner runs
over the transform in the run_pipeline method. I moved a method that
verifies that coders as inputs to GBKs are deterministic during this
run_pipeline. Previously, this was during the apply_GroupByKey.

On Tue, May 19, 2020 at 4:48 PM Brian Hulette <bhule...@google.com> wrote:

> Yes I'm unclear on how a PCollection with ExternalCoder made it into a
> downstream transform that enforces is_deterministic. My understanding of
> ExternalCoder (admittedly just based on a quick look at commit history) is
> that it's a shim added so the Python SDK can handle coders that are
> internal to cross-language transforms.
> I think that if the Python SDK is trying to introspect an ExternalCoder
> instance then something is wrong.
>
> Brian
>
> On Tue, May 19, 2020 at 4:01 PM Luke Cwik <lc...@google.com> wrote:
>
>> I see. The problem is that you are trying to know certain properties of
>> the coder to use in a downstream transform which enforces that it is
>> deterministic like GroupByKey.
>>
>> In all the scenarios so far that I have seen we have required both SDKs
>> to understand the coder, how are you having a cross language pipeline where
>> the downstream SDK doesn't understand the coder and works?
>>
>> Also, an alternative strategy would be to tell the expansion service that
>> you need to choose a coder that is deterministic on the output. This would
>> require building the pipeline and before submission to the job server
>> perform the expansion telling it all the limitations that the SDK has
>> imposed on it.
>>
>>
>>
>>
>> On Tue, May 19, 2020 at 3:45 PM Sam Rohde <sro...@google.com> wrote:
>>
>>> Hi all,
>>>
>>> Should there be more metadata in the Coder Proto? For example, adding an
>>> "is_deterministic" boolean field. This will allow for a language-agnostic
>>> way for SDKs to infer properties about a coder received from the expansion
>>> service.
>>>
>>> My motivation for this is that I recently ran into a problem in which an
>>> "ExternalCoder" in the Python SDK was erroneously marked as
>>> non-deterministic. The reason being is that the Coder proto doesn't have an
>>> "is_deterministic" and when the coder fails to be recreated in Python, the
>>> ExternalCoder defaults to False.
>>>
>>> Regards,
>>> Sam
>>>
>>>

Reply via email to