I think the Dataflow update concern is the biggest concern and as you point
out that can be easily overcome. I generally believe that users who aren't
setting the coder don't actually care as long as it works, so changing the
default across Beam versions seems relatively low risk. Another mitigating
factor is that both concerns require users to actually be using the coder
(likely via Reshuffle.viaRandomKey) rather than consuming the output in a
fused ParDo (which I think is what we would recommend).

As a counter proposal: is there something that stops us from using RowCoder
by default here? I assume all forms of "PubsubMessage" can be encoded with
RowCoder, it provides flexibility for future changes, and PubSub will be
able to take advantage of future work to make RowCoder more efficient. (If
we can't switch to RowCoder, that seems like a useful feature request for
RowCoder.)

Andrew

On Mon, Dec 19, 2022 at 3:37 PM Evan Galpin <egal...@apache.org> wrote:

> Bump 🙃
>
> Any other risks or drawbacks associated with altering the default coder
> for PubsubMessage to be the most inclusive coder with respect to possible
> fields?
>
> Thanks,
> Evan
>
>
> On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin <egal...@apache.org> wrote:
>
>> Hi folks,
>>
>> I'd like to solicit feedback on the notion of using
>> PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the
>> default coder for Pubsub messages instead of the current default of
>> PubsubMessageWithAttributesCoder.
>>
>> Not long ago, support for reading and writing Pubsub messages in Beam
>> including an OrderingKey was added[2].  Part of this change involved adding
>> a new Coder for PubsubMessage in order to capture and propagate the
>> orderingKey[1].  This change illuminated that in cases where the coder type
>> for PubsubMessage is inferred, it is possible to accidentally and silently
>> nullify fields like MessageId and OrderingKey in a way that is not at all
>> obvious to users[3].
>>
>> So far two potential drawbacks of this proposal have been identified:
>> 1. Update compatibility for pipelines using PubsubIO might require users
>> to explicitly specify the current default coder (
>> PubsubMessageWithAttributesCoder)
>> 2. Messages would require a larger number of bytes to store as compared
>> to the current default (which could again be overcome by users specifying
>> the current default coder)
>>
>> What other potential drawbacks might there be? I look forward to hearing
>> others' input!
>>
>> Thanks,
>> Evan
>>
>> [1]
>> https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd
>> [2] https://github.com/apache/beam/pull/22216
>> [3] https://github.com/apache/beam/issues/23525
>>
>

Reply via email to