Re: [Proposal] Change to Default PubsubMessage Coder

2022-12-19 Thread Robert Burke
Technically, since the PubSubMessage is just a protocol buffer, I'm a bit
surprised we aren't using it's native encoding under the hood.

The Go SDK accepts either byte slices (which it then populates a messages's
data field) or accept the proto representations which it publish straight
out.

Similarly for reads, either the data field is extracted as bytes, or the
whole PubSubMessage.

Granted part of this is just because the coder set up in the Go SDK isn't
as maliable as Java or Python, and the types aren't recodable between
transforms.





Beam Go Busybody
Robert Burke

On Mon, Dec 19, 2022, 4:18 PM Ahmed Abualsaud via dev 
wrote:

> +1 to looking into using RowCoder, this may help avoid creating more
> specialized coders in the future (which is mentioned as a pain point in the
> issue you linked [1]).
>
> [1] https://github.com/apache/beam/issues/23525#issuecomment-1281294275
>
> On Tue, Dec 20, 2022 at 3:00 AM Andrew Pilloud via dev <
> dev@beam.apache.org> wrote:
>
>> I think the Dataflow update concern is the biggest concern and as you
>> point out that can be easily overcome. I generally believe that users who
>> aren't setting the coder don't actually care as long as it works, so
>> changing the default across Beam versions seems relatively low
>> risk. Another mitigating factor is that both concerns require users to
>> actually be using the coder (likely via Reshuffle.viaRandomKey) rather than
>> consuming the output in a fused ParDo (which I think is what we would
>> recommend).
>>
>> As a counter proposal: is there something that stops us from using
>> RowCoder by default here? I assume all forms of "PubsubMessage" can be
>> encoded with RowCoder, it provides flexibility for future changes, and
>> PubSub will be able to take advantage of future work to make RowCoder more
>> efficient. (If we can't switch to RowCoder, that seems like a useful
>> feature request for RowCoder.)
>>
>> Andrew
>>
>> On Mon, Dec 19, 2022 at 3:37 PM Evan Galpin  wrote:
>>
>>> Bump 
>>>
>>> Any other risks or drawbacks associated with altering the default coder
>>> for PubsubMessage to be the most inclusive coder with respect to possible
>>> fields?
>>>
>>> Thanks,
>>> Evan
>>>
>>>
>>> On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin  wrote:
>>>
 Hi folks,

 I'd like to solicit feedback on the notion of using
 PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the
 default coder for Pubsub messages instead of the current default of
 PubsubMessageWithAttributesCoder.

 Not long ago, support for reading and writing Pubsub messages in Beam
 including an OrderingKey was added[2].  Part of this change involved adding
 a new Coder for PubsubMessage in order to capture and propagate the
 orderingKey[1].  This change illuminated that in cases where the coder type
 for PubsubMessage is inferred, it is possible to accidentally and silently
 nullify fields like MessageId and OrderingKey in a way that is not at all
 obvious to users[3].

 So far two potential drawbacks of this proposal have been identified:
 1. Update compatibility for pipelines using PubsubIO might require
 users to explicitly specify the current default coder (
 PubsubMessageWithAttributesCoder)
 2. Messages would require a larger number of bytes to store as compared
 to the current default (which could again be overcome by users specifying
 the current default coder)

 What other potential drawbacks might there be? I look forward to
 hearing others' input!

 Thanks,
 Evan

 [1]
 https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd
 [2] https://github.com/apache/beam/pull/22216
 [3] https://github.com/apache/beam/issues/23525

>>>


Re: [Proposal] Change to Default PubsubMessage Coder

2022-12-19 Thread Ahmed Abualsaud via dev
+1 to looking into using RowCoder, this may help avoid creating more
specialized coders in the future (which is mentioned as a pain point in the
issue you linked [1]).

[1] https://github.com/apache/beam/issues/23525#issuecomment-1281294275

On Tue, Dec 20, 2022 at 3:00 AM Andrew Pilloud via dev 
wrote:

> I think the Dataflow update concern is the biggest concern and as you
> point out that can be easily overcome. I generally believe that users who
> aren't setting the coder don't actually care as long as it works, so
> changing the default across Beam versions seems relatively low
> risk. Another mitigating factor is that both concerns require users to
> actually be using the coder (likely via Reshuffle.viaRandomKey) rather than
> consuming the output in a fused ParDo (which I think is what we would
> recommend).
>
> As a counter proposal: is there something that stops us from using
> RowCoder by default here? I assume all forms of "PubsubMessage" can be
> encoded with RowCoder, it provides flexibility for future changes, and
> PubSub will be able to take advantage of future work to make RowCoder more
> efficient. (If we can't switch to RowCoder, that seems like a useful
> feature request for RowCoder.)
>
> Andrew
>
> On Mon, Dec 19, 2022 at 3:37 PM Evan Galpin  wrote:
>
>> Bump 
>>
>> Any other risks or drawbacks associated with altering the default coder
>> for PubsubMessage to be the most inclusive coder with respect to possible
>> fields?
>>
>> Thanks,
>> Evan
>>
>>
>> On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin  wrote:
>>
>>> Hi folks,
>>>
>>> I'd like to solicit feedback on the notion of using
>>> PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the
>>> default coder for Pubsub messages instead of the current default of
>>> PubsubMessageWithAttributesCoder.
>>>
>>> Not long ago, support for reading and writing Pubsub messages in Beam
>>> including an OrderingKey was added[2].  Part of this change involved adding
>>> a new Coder for PubsubMessage in order to capture and propagate the
>>> orderingKey[1].  This change illuminated that in cases where the coder type
>>> for PubsubMessage is inferred, it is possible to accidentally and silently
>>> nullify fields like MessageId and OrderingKey in a way that is not at all
>>> obvious to users[3].
>>>
>>> So far two potential drawbacks of this proposal have been identified:
>>> 1. Update compatibility for pipelines using PubsubIO might require users
>>> to explicitly specify the current default coder (
>>> PubsubMessageWithAttributesCoder)
>>> 2. Messages would require a larger number of bytes to store as compared
>>> to the current default (which could again be overcome by users specifying
>>> the current default coder)
>>>
>>> What other potential drawbacks might there be? I look forward to hearing
>>> others' input!
>>>
>>> Thanks,
>>> Evan
>>>
>>> [1]
>>> https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd
>>> [2] https://github.com/apache/beam/pull/22216
>>> [3] https://github.com/apache/beam/issues/23525
>>>
>>


Re: [Proposal] Change to Default PubsubMessage Coder

2022-12-19 Thread Andrew Pilloud via dev
I think the Dataflow update concern is the biggest concern and as you point
out that can be easily overcome. I generally believe that users who aren't
setting the coder don't actually care as long as it works, so changing the
default across Beam versions seems relatively low risk. Another mitigating
factor is that both concerns require users to actually be using the coder
(likely via Reshuffle.viaRandomKey) rather than consuming the output in a
fused ParDo (which I think is what we would recommend).

As a counter proposal: is there something that stops us from using RowCoder
by default here? I assume all forms of "PubsubMessage" can be encoded with
RowCoder, it provides flexibility for future changes, and PubSub will be
able to take advantage of future work to make RowCoder more efficient. (If
we can't switch to RowCoder, that seems like a useful feature request for
RowCoder.)

Andrew

On Mon, Dec 19, 2022 at 3:37 PM Evan Galpin  wrote:

> Bump 
>
> Any other risks or drawbacks associated with altering the default coder
> for PubsubMessage to be the most inclusive coder with respect to possible
> fields?
>
> Thanks,
> Evan
>
>
> On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin  wrote:
>
>> Hi folks,
>>
>> I'd like to solicit feedback on the notion of using
>> PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the
>> default coder for Pubsub messages instead of the current default of
>> PubsubMessageWithAttributesCoder.
>>
>> Not long ago, support for reading and writing Pubsub messages in Beam
>> including an OrderingKey was added[2].  Part of this change involved adding
>> a new Coder for PubsubMessage in order to capture and propagate the
>> orderingKey[1].  This change illuminated that in cases where the coder type
>> for PubsubMessage is inferred, it is possible to accidentally and silently
>> nullify fields like MessageId and OrderingKey in a way that is not at all
>> obvious to users[3].
>>
>> So far two potential drawbacks of this proposal have been identified:
>> 1. Update compatibility for pipelines using PubsubIO might require users
>> to explicitly specify the current default coder (
>> PubsubMessageWithAttributesCoder)
>> 2. Messages would require a larger number of bytes to store as compared
>> to the current default (which could again be overcome by users specifying
>> the current default coder)
>>
>> What other potential drawbacks might there be? I look forward to hearing
>> others' input!
>>
>> Thanks,
>> Evan
>>
>> [1]
>> https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd
>> [2] https://github.com/apache/beam/pull/22216
>> [3] https://github.com/apache/beam/issues/23525
>>
>


Re: [Proposal] Change to Default PubsubMessage Coder

2022-12-19 Thread Evan Galpin
Bump 

Any other risks or drawbacks associated with altering the default coder for
PubsubMessage to be the most inclusive coder with respect to possible
fields?

Thanks,
Evan


On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin  wrote:

> Hi folks,
>
> I'd like to solicit feedback on the notion of using
> PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the
> default coder for Pubsub messages instead of the current default of
> PubsubMessageWithAttributesCoder.
>
> Not long ago, support for reading and writing Pubsub messages in Beam
> including an OrderingKey was added[2].  Part of this change involved adding
> a new Coder for PubsubMessage in order to capture and propagate the
> orderingKey[1].  This change illuminated that in cases where the coder type
> for PubsubMessage is inferred, it is possible to accidentally and silently
> nullify fields like MessageId and OrderingKey in a way that is not at all
> obvious to users[3].
>
> So far two potential drawbacks of this proposal have been identified:
> 1. Update compatibility for pipelines using PubsubIO might require users
> to explicitly specify the current default coder (
> PubsubMessageWithAttributesCoder)
> 2. Messages would require a larger number of bytes to store as compared to
> the current default (which could again be overcome by users specifying the
> current default coder)
>
> What other potential drawbacks might there be? I look forward to hearing
> others' input!
>
> Thanks,
> Evan
>
> [1]
> https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd
> [2] https://github.com/apache/beam/pull/22216
> [3] https://github.com/apache/beam/issues/23525
>


Re: [PROPOSAL] Preparing for Apache Beam 2.44.0 Release

2022-12-19 Thread Ahmet Altay via dev
Take care Kenn, hope you will feel better soon.

How about you continue after the new year? Hopefully you will be feeling
better. Handing off would be hard, and I imagine not a lot of people
would be around to validate in the next 2 weeks anyway.

On Mon, Dec 19, 2022 at 10:17 AM Kenneth Knowles  wrote:

> I managed to acquire covid for my last work week before vacation, so I
> don't expect to make a lot of progress. I'm not sure the best way to hand
> off release processes.
>
> Kenn
>
> On Fri, Dec 16, 2022 at 5:30 PM Ahmet Altay via dev 
> wrote:
>
>> Hello! How is the RC coming along? Do you need help?
>>
>> On Wed, Dec 14, 2022 at 2:33 PM Kenneth Knowles  wrote:
>>
>>> I've edited the subject for this update. There are no more open bugs
>>> targeting the release milestone. I will prepare RC1 shortly.
>>>
>>> Kenn
>>>
>>> On Thu, Dec 1, 2022 at 12:55 PM Kenneth Knowles  wrote:
>>>
 Just an update that the branch is cut.

 There are 8 issues targeted to the release milestone:
 https://github.com/apache/beam/milestone/7 (thanks Cham for the
 correct link!)

 Please help to close these out or triage them off the milestone. I will
 be looking at them now.

 Kenn

 On Thu, Nov 17, 2022 at 2:27 PM Chamikara Jayalath via dev <
 dev@beam.apache.org> wrote:

>
> Thanks Kenn.
> BTW the correct milestone for the 2.44.0 release should be this one:
> https://github.com/apache/beam/milestone/7
>
> - Cham
>
>
> On Thu, Nov 17, 2022 at 9:12 AM Ahmet Altay via dev <
> dev@beam.apache.org> wrote:
>
>> Thank you Kenn! :)
>>
>> On Wed, Nov 16, 2022 at 12:45 PM Kenneth Knowles 
>> wrote:
>>
>>> Hi all,
>>>
>>> The 2.44.0 release cut is scheduled for Nov 30th [1]. I'd like to
>>> volunteer to do this release.
>>>
>>> As usual, my plan would be to cut right on that date and cherry
>>> pick critical fixes.
>>>
>>> Help me and the release by:
>>> - Making sure that any unresolved release blocking issues for 2.44.0
>>> have their "Milestone" marked as "2.44.0 Release" [2].
>>> - Reviewing the current release blockers [2] and removing the
>>> Milestone if they don't meet the criteria at [3].
>>>
>>> Kenn
>>>
>>> [1]
>>> https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com
>>> [2] https://github.com/apache/beam/milestone/5
>>> [3] https://beam.apache.org/contribute/release-blocking/
>>>
>>> Kenn
>>>
>>


Re: [PROPOSAL] Preparing for Apache Beam 2.44.0 Release

2022-12-19 Thread Kenneth Knowles
I managed to acquire covid for my last work week before vacation, so I
don't expect to make a lot of progress. I'm not sure the best way to hand
off release processes.

Kenn

On Fri, Dec 16, 2022 at 5:30 PM Ahmet Altay via dev 
wrote:

> Hello! How is the RC coming along? Do you need help?
>
> On Wed, Dec 14, 2022 at 2:33 PM Kenneth Knowles  wrote:
>
>> I've edited the subject for this update. There are no more open bugs
>> targeting the release milestone. I will prepare RC1 shortly.
>>
>> Kenn
>>
>> On Thu, Dec 1, 2022 at 12:55 PM Kenneth Knowles  wrote:
>>
>>> Just an update that the branch is cut.
>>>
>>> There are 8 issues targeted to the release milestone:
>>> https://github.com/apache/beam/milestone/7 (thanks Cham for the correct
>>> link!)
>>>
>>> Please help to close these out or triage them off the milestone. I will
>>> be looking at them now.
>>>
>>> Kenn
>>>
>>> On Thu, Nov 17, 2022 at 2:27 PM Chamikara Jayalath via dev <
>>> dev@beam.apache.org> wrote:
>>>

 Thanks Kenn.
 BTW the correct milestone for the 2.44.0 release should be this one:
 https://github.com/apache/beam/milestone/7

 - Cham


 On Thu, Nov 17, 2022 at 9:12 AM Ahmet Altay via dev <
 dev@beam.apache.org> wrote:

> Thank you Kenn! :)
>
> On Wed, Nov 16, 2022 at 12:45 PM Kenneth Knowles 
> wrote:
>
>> Hi all,
>>
>> The 2.44.0 release cut is scheduled for Nov 30th [1]. I'd like to
>> volunteer to do this release.
>>
>> As usual, my plan would be to cut right on that date and cherry
>> pick critical fixes.
>>
>> Help me and the release by:
>> - Making sure that any unresolved release blocking issues for 2.44.0
>> have their "Milestone" marked as "2.44.0 Release" [2].
>> - Reviewing the current release blockers [2] and removing the
>> Milestone if they don't meet the criteria at [3].
>>
>> Kenn
>>
>> [1]
>> https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com
>> [2] https://github.com/apache/beam/milestone/5
>> [3] https://beam.apache.org/contribute/release-blocking/
>>
>> Kenn
>>
>


Beam High Priority Issue Report (38)

2022-12-19 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/24675 [Bug]: POM of beam-sdks-java-core 
2.45.0-SNAPSHOT is broken
https://github.com/apache/beam/issues/24655 [Bug]: Pipeline fusion should break 
at @RequiresStableInput boundary
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24367 [Bug]: workflow.tar.gz cannot be 
passed to flink runner
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/24267 [Failing Test]: Timeout waiting to 
lock gradle
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23286 [Bug]: 
beam_PerformanceTests_InfluxDbIO_IT Flaky > 50 % Fail 
https://github.com/apache/beam/issues/22969 Discrepancy in behavior of 
`DoFn.process()` when `yield` is combined with `return` statement, or vice versa
https://github.com/apache/beam/issues/22961 [Bug]: WriteToBigQuery silently 
skips most of records without job fail
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get output 
to Failed Inserts PCollection
https://github.com/apache/beam/issues/21695 DataflowPipelineResult does not 
raise exception for unsuccessful states.
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey
https://github.com/apache/beam/issues/21104 Flaky: 
apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers
https://github.com/apache/beam/issues/20976 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky
https://github.com/apache/beam/issues/20974 Python GHA PreCommits flake with 
grpc.FutureTimeoutError on SDK harness startup
https://github.com/apache/beam/issues/20689 Kafka commitOffsetsInFinalize OOM 
on Flink
https://github.com/apache/beam/issues/20108 Python direct runner doesn't emit 
empty pane when it should
https://github.com/apache/beam/issues/19814 Flink streaming flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
https://github.com/apache/beam/issues/19241 Python Dataflow integration tests 
should export the pipeline Job ID and console output to Jenkins Test Result 
section


P1 Issues with no update in the last week:

https://github.com/apache/beam/issues/24464 [Epic]: Implement 
FileWriteSchemaTransformProvider
https://github.com/apache/beam/issues/23875 [Bug]: beam.Row.__eq__ returns true 
for unequal rows
https://github.com/apache/beam/issues/23306 [Bug]: BigQueryBatchFileLoads in 
python loses data when using WRITE_TRUNCATE
https://github.com/apache/beam/issues/22115 [Bug]: 
apache_beam.runners.portability.portable_runner_test.PortableRunnerTestWithSubprocesses
 is flaky
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21708 beam_PostCommit_Java_DataflowV2, 
testBigQueryStorageWrite30MProto failing consistently
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21700 
--dataflowServiceOptions=use_runner_v2 is broken
https://github.com/apache/beam/issues/21645 
beam_PostCommit_XVR_GoUsingJava_Dataflow fails on some test transforms
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21424