+1 to looking into using RowCoder, this may help avoid creating more
specialized coders in the future (which is mentioned as a pain point in the
issue you linked [1]).

[1] https://github.com/apache/beam/issues/23525#issuecomment-1281294275

On Tue, Dec 20, 2022 at 3:00 AM Andrew Pilloud via dev <dev@beam.apache.org>
wrote:

> I think the Dataflow update concern is the biggest concern and as you
> point out that can be easily overcome. I generally believe that users who
> aren't setting the coder don't actually care as long as it works, so
> changing the default across Beam versions seems relatively low
> risk. Another mitigating factor is that both concerns require users to
> actually be using the coder (likely via Reshuffle.viaRandomKey) rather than
> consuming the output in a fused ParDo (which I think is what we would
> recommend).
>
> As a counter proposal: is there something that stops us from using
> RowCoder by default here? I assume all forms of "PubsubMessage" can be
> encoded with RowCoder, it provides flexibility for future changes, and
> PubSub will be able to take advantage of future work to make RowCoder more
> efficient. (If we can't switch to RowCoder, that seems like a useful
> feature request for RowCoder.)
>
> Andrew
>
> On Mon, Dec 19, 2022 at 3:37 PM Evan Galpin <egal...@apache.org> wrote:
>
>> Bump 🙃
>>
>> Any other risks or drawbacks associated with altering the default coder
>> for PubsubMessage to be the most inclusive coder with respect to possible
>> fields?
>>
>> Thanks,
>> Evan
>>
>>
>> On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin <egal...@apache.org> wrote:
>>
>>> Hi folks,
>>>
>>> I'd like to solicit feedback on the notion of using
>>> PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the
>>> default coder for Pubsub messages instead of the current default of
>>> PubsubMessageWithAttributesCoder.
>>>
>>> Not long ago, support for reading and writing Pubsub messages in Beam
>>> including an OrderingKey was added[2].  Part of this change involved adding
>>> a new Coder for PubsubMessage in order to capture and propagate the
>>> orderingKey[1].  This change illuminated that in cases where the coder type
>>> for PubsubMessage is inferred, it is possible to accidentally and silently
>>> nullify fields like MessageId and OrderingKey in a way that is not at all
>>> obvious to users[3].
>>>
>>> So far two potential drawbacks of this proposal have been identified:
>>> 1. Update compatibility for pipelines using PubsubIO might require users
>>> to explicitly specify the current default coder (
>>> PubsubMessageWithAttributesCoder)
>>> 2. Messages would require a larger number of bytes to store as compared
>>> to the current default (which could again be overcome by users specifying
>>> the current default coder)
>>>
>>> What other potential drawbacks might there be? I look forward to hearing
>>> others' input!
>>>
>>> Thanks,
>>> Evan
>>>
>>> [1]
>>> https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd
>>> [2] https://github.com/apache/beam/pull/22216
>>> [3] https://github.com/apache/beam/issues/23525
>>>
>>

Reply via email to