Re: [Proposal] Change to Default PubsubMessage Coder
Technically, since the PubSubMessage is just a protocol buffer, I'm a bit surprised we aren't using it's native encoding under the hood. The Go SDK accepts either byte slices (which it then populates a messages's data field) or accept the proto representations which it publish straight out. Similarly for reads, either the data field is extracted as bytes, or the whole PubSubMessage. Granted part of this is just because the coder set up in the Go SDK isn't as maliable as Java or Python, and the types aren't recodable between transforms. Beam Go Busybody Robert Burke On Mon, Dec 19, 2022, 4:18 PM Ahmed Abualsaud via dev wrote: > +1 to looking into using RowCoder, this may help avoid creating more > specialized coders in the future (which is mentioned as a pain point in the > issue you linked [1]). > > [1] https://github.com/apache/beam/issues/23525#issuecomment-1281294275 > > On Tue, Dec 20, 2022 at 3:00 AM Andrew Pilloud via dev < > dev@beam.apache.org> wrote: > >> I think the Dataflow update concern is the biggest concern and as you >> point out that can be easily overcome. I generally believe that users who >> aren't setting the coder don't actually care as long as it works, so >> changing the default across Beam versions seems relatively low >> risk. Another mitigating factor is that both concerns require users to >> actually be using the coder (likely via Reshuffle.viaRandomKey) rather than >> consuming the output in a fused ParDo (which I think is what we would >> recommend). >> >> As a counter proposal: is there something that stops us from using >> RowCoder by default here? I assume all forms of "PubsubMessage" can be >> encoded with RowCoder, it provides flexibility for future changes, and >> PubSub will be able to take advantage of future work to make RowCoder more >> efficient. (If we can't switch to RowCoder, that seems like a useful >> feature request for RowCoder.) >> >> Andrew >> >> On Mon, Dec 19, 2022 at 3:37 PM Evan Galpin wrote: >> >>> Bump >>> >>> Any other risks or drawbacks associated with altering the default coder >>> for PubsubMessage to be the most inclusive coder with respect to possible >>> fields? >>> >>> Thanks, >>> Evan >>> >>> >>> On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin wrote: >>> Hi folks, I'd like to solicit feedback on the notion of using PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the default coder for Pubsub messages instead of the current default of PubsubMessageWithAttributesCoder. Not long ago, support for reading and writing Pubsub messages in Beam including an OrderingKey was added[2]. Part of this change involved adding a new Coder for PubsubMessage in order to capture and propagate the orderingKey[1]. This change illuminated that in cases where the coder type for PubsubMessage is inferred, it is possible to accidentally and silently nullify fields like MessageId and OrderingKey in a way that is not at all obvious to users[3]. So far two potential drawbacks of this proposal have been identified: 1. Update compatibility for pipelines using PubsubIO might require users to explicitly specify the current default coder ( PubsubMessageWithAttributesCoder) 2. Messages would require a larger number of bytes to store as compared to the current default (which could again be overcome by users specifying the current default coder) What other potential drawbacks might there be? I look forward to hearing others' input! Thanks, Evan [1] https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd [2] https://github.com/apache/beam/pull/22216 [3] https://github.com/apache/beam/issues/23525 >>>
Re: [Proposal] Change to Default PubsubMessage Coder
+1 to looking into using RowCoder, this may help avoid creating more specialized coders in the future (which is mentioned as a pain point in the issue you linked [1]). [1] https://github.com/apache/beam/issues/23525#issuecomment-1281294275 On Tue, Dec 20, 2022 at 3:00 AM Andrew Pilloud via dev wrote: > I think the Dataflow update concern is the biggest concern and as you > point out that can be easily overcome. I generally believe that users who > aren't setting the coder don't actually care as long as it works, so > changing the default across Beam versions seems relatively low > risk. Another mitigating factor is that both concerns require users to > actually be using the coder (likely via Reshuffle.viaRandomKey) rather than > consuming the output in a fused ParDo (which I think is what we would > recommend). > > As a counter proposal: is there something that stops us from using > RowCoder by default here? I assume all forms of "PubsubMessage" can be > encoded with RowCoder, it provides flexibility for future changes, and > PubSub will be able to take advantage of future work to make RowCoder more > efficient. (If we can't switch to RowCoder, that seems like a useful > feature request for RowCoder.) > > Andrew > > On Mon, Dec 19, 2022 at 3:37 PM Evan Galpin wrote: > >> Bump >> >> Any other risks or drawbacks associated with altering the default coder >> for PubsubMessage to be the most inclusive coder with respect to possible >> fields? >> >> Thanks, >> Evan >> >> >> On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin wrote: >> >>> Hi folks, >>> >>> I'd like to solicit feedback on the notion of using >>> PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the >>> default coder for Pubsub messages instead of the current default of >>> PubsubMessageWithAttributesCoder. >>> >>> Not long ago, support for reading and writing Pubsub messages in Beam >>> including an OrderingKey was added[2]. Part of this change involved adding >>> a new Coder for PubsubMessage in order to capture and propagate the >>> orderingKey[1]. This change illuminated that in cases where the coder type >>> for PubsubMessage is inferred, it is possible to accidentally and silently >>> nullify fields like MessageId and OrderingKey in a way that is not at all >>> obvious to users[3]. >>> >>> So far two potential drawbacks of this proposal have been identified: >>> 1. Update compatibility for pipelines using PubsubIO might require users >>> to explicitly specify the current default coder ( >>> PubsubMessageWithAttributesCoder) >>> 2. Messages would require a larger number of bytes to store as compared >>> to the current default (which could again be overcome by users specifying >>> the current default coder) >>> >>> What other potential drawbacks might there be? I look forward to hearing >>> others' input! >>> >>> Thanks, >>> Evan >>> >>> [1] >>> https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd >>> [2] https://github.com/apache/beam/pull/22216 >>> [3] https://github.com/apache/beam/issues/23525 >>> >>
Re: [Proposal] Change to Default PubsubMessage Coder
I think the Dataflow update concern is the biggest concern and as you point out that can be easily overcome. I generally believe that users who aren't setting the coder don't actually care as long as it works, so changing the default across Beam versions seems relatively low risk. Another mitigating factor is that both concerns require users to actually be using the coder (likely via Reshuffle.viaRandomKey) rather than consuming the output in a fused ParDo (which I think is what we would recommend). As a counter proposal: is there something that stops us from using RowCoder by default here? I assume all forms of "PubsubMessage" can be encoded with RowCoder, it provides flexibility for future changes, and PubSub will be able to take advantage of future work to make RowCoder more efficient. (If we can't switch to RowCoder, that seems like a useful feature request for RowCoder.) Andrew On Mon, Dec 19, 2022 at 3:37 PM Evan Galpin wrote: > Bump > > Any other risks or drawbacks associated with altering the default coder > for PubsubMessage to be the most inclusive coder with respect to possible > fields? > > Thanks, > Evan > > > On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin wrote: > >> Hi folks, >> >> I'd like to solicit feedback on the notion of using >> PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the >> default coder for Pubsub messages instead of the current default of >> PubsubMessageWithAttributesCoder. >> >> Not long ago, support for reading and writing Pubsub messages in Beam >> including an OrderingKey was added[2]. Part of this change involved adding >> a new Coder for PubsubMessage in order to capture and propagate the >> orderingKey[1]. This change illuminated that in cases where the coder type >> for PubsubMessage is inferred, it is possible to accidentally and silently >> nullify fields like MessageId and OrderingKey in a way that is not at all >> obvious to users[3]. >> >> So far two potential drawbacks of this proposal have been identified: >> 1. Update compatibility for pipelines using PubsubIO might require users >> to explicitly specify the current default coder ( >> PubsubMessageWithAttributesCoder) >> 2. Messages would require a larger number of bytes to store as compared >> to the current default (which could again be overcome by users specifying >> the current default coder) >> >> What other potential drawbacks might there be? I look forward to hearing >> others' input! >> >> Thanks, >> Evan >> >> [1] >> https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd >> [2] https://github.com/apache/beam/pull/22216 >> [3] https://github.com/apache/beam/issues/23525 >> >
Re: [Proposal] Change to Default PubsubMessage Coder
Bump Any other risks or drawbacks associated with altering the default coder for PubsubMessage to be the most inclusive coder with respect to possible fields? Thanks, Evan On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin wrote: > Hi folks, > > I'd like to solicit feedback on the notion of using > PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the > default coder for Pubsub messages instead of the current default of > PubsubMessageWithAttributesCoder. > > Not long ago, support for reading and writing Pubsub messages in Beam > including an OrderingKey was added[2]. Part of this change involved adding > a new Coder for PubsubMessage in order to capture and propagate the > orderingKey[1]. This change illuminated that in cases where the coder type > for PubsubMessage is inferred, it is possible to accidentally and silently > nullify fields like MessageId and OrderingKey in a way that is not at all > obvious to users[3]. > > So far two potential drawbacks of this proposal have been identified: > 1. Update compatibility for pipelines using PubsubIO might require users > to explicitly specify the current default coder ( > PubsubMessageWithAttributesCoder) > 2. Messages would require a larger number of bytes to store as compared to > the current default (which could again be overcome by users specifying the > current default coder) > > What other potential drawbacks might there be? I look forward to hearing > others' input! > > Thanks, > Evan > > [1] > https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd > [2] https://github.com/apache/beam/pull/22216 > [3] https://github.com/apache/beam/issues/23525 >
Re: [PROPOSAL] Preparing for Apache Beam 2.44.0 Release
Take care Kenn, hope you will feel better soon. How about you continue after the new year? Hopefully you will be feeling better. Handing off would be hard, and I imagine not a lot of people would be around to validate in the next 2 weeks anyway. On Mon, Dec 19, 2022 at 10:17 AM Kenneth Knowles wrote: > I managed to acquire covid for my last work week before vacation, so I > don't expect to make a lot of progress. I'm not sure the best way to hand > off release processes. > > Kenn > > On Fri, Dec 16, 2022 at 5:30 PM Ahmet Altay via dev > wrote: > >> Hello! How is the RC coming along? Do you need help? >> >> On Wed, Dec 14, 2022 at 2:33 PM Kenneth Knowles wrote: >> >>> I've edited the subject for this update. There are no more open bugs >>> targeting the release milestone. I will prepare RC1 shortly. >>> >>> Kenn >>> >>> On Thu, Dec 1, 2022 at 12:55 PM Kenneth Knowles wrote: >>> Just an update that the branch is cut. There are 8 issues targeted to the release milestone: https://github.com/apache/beam/milestone/7 (thanks Cham for the correct link!) Please help to close these out or triage them off the milestone. I will be looking at them now. Kenn On Thu, Nov 17, 2022 at 2:27 PM Chamikara Jayalath via dev < dev@beam.apache.org> wrote: > > Thanks Kenn. > BTW the correct milestone for the 2.44.0 release should be this one: > https://github.com/apache/beam/milestone/7 > > - Cham > > > On Thu, Nov 17, 2022 at 9:12 AM Ahmet Altay via dev < > dev@beam.apache.org> wrote: > >> Thank you Kenn! :) >> >> On Wed, Nov 16, 2022 at 12:45 PM Kenneth Knowles >> wrote: >> >>> Hi all, >>> >>> The 2.44.0 release cut is scheduled for Nov 30th [1]. I'd like to >>> volunteer to do this release. >>> >>> As usual, my plan would be to cut right on that date and cherry >>> pick critical fixes. >>> >>> Help me and the release by: >>> - Making sure that any unresolved release blocking issues for 2.44.0 >>> have their "Milestone" marked as "2.44.0 Release" [2]. >>> - Reviewing the current release blockers [2] and removing the >>> Milestone if they don't meet the criteria at [3]. >>> >>> Kenn >>> >>> [1] >>> https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com >>> [2] https://github.com/apache/beam/milestone/5 >>> [3] https://beam.apache.org/contribute/release-blocking/ >>> >>> Kenn >>> >>
Re: [PROPOSAL] Preparing for Apache Beam 2.44.0 Release
I managed to acquire covid for my last work week before vacation, so I don't expect to make a lot of progress. I'm not sure the best way to hand off release processes. Kenn On Fri, Dec 16, 2022 at 5:30 PM Ahmet Altay via dev wrote: > Hello! How is the RC coming along? Do you need help? > > On Wed, Dec 14, 2022 at 2:33 PM Kenneth Knowles wrote: > >> I've edited the subject for this update. There are no more open bugs >> targeting the release milestone. I will prepare RC1 shortly. >> >> Kenn >> >> On Thu, Dec 1, 2022 at 12:55 PM Kenneth Knowles wrote: >> >>> Just an update that the branch is cut. >>> >>> There are 8 issues targeted to the release milestone: >>> https://github.com/apache/beam/milestone/7 (thanks Cham for the correct >>> link!) >>> >>> Please help to close these out or triage them off the milestone. I will >>> be looking at them now. >>> >>> Kenn >>> >>> On Thu, Nov 17, 2022 at 2:27 PM Chamikara Jayalath via dev < >>> dev@beam.apache.org> wrote: >>> Thanks Kenn. BTW the correct milestone for the 2.44.0 release should be this one: https://github.com/apache/beam/milestone/7 - Cham On Thu, Nov 17, 2022 at 9:12 AM Ahmet Altay via dev < dev@beam.apache.org> wrote: > Thank you Kenn! :) > > On Wed, Nov 16, 2022 at 12:45 PM Kenneth Knowles > wrote: > >> Hi all, >> >> The 2.44.0 release cut is scheduled for Nov 30th [1]. I'd like to >> volunteer to do this release. >> >> As usual, my plan would be to cut right on that date and cherry >> pick critical fixes. >> >> Help me and the release by: >> - Making sure that any unresolved release blocking issues for 2.44.0 >> have their "Milestone" marked as "2.44.0 Release" [2]. >> - Reviewing the current release blockers [2] and removing the >> Milestone if they don't meet the criteria at [3]. >> >> Kenn >> >> [1] >> https://calendar.google.com/calendar/u/0/embed?src=0p73sl034k80oob7seouani...@group.calendar.google.com >> [2] https://github.com/apache/beam/milestone/5 >> [3] https://beam.apache.org/contribute/release-blocking/ >> >> Kenn >> >
Beam High Priority Issue Report (38)
This is your daily summary of Beam's current high priority issues that may need attention. See https://beam.apache.org/contribute/issue-priorities for the meaning and expectations around issue priorities. Unassigned P1 Issues: https://github.com/apache/beam/issues/24675 [Bug]: POM of beam-sdks-java-core 2.45.0-SNAPSHOT is broken https://github.com/apache/beam/issues/24655 [Bug]: Pipeline fusion should break at @RequiresStableInput boundary https://github.com/apache/beam/issues/24389 [Failing Test]: HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError ContainerFetchException https://github.com/apache/beam/issues/24367 [Bug]: workflow.tar.gz cannot be passed to flink runner https://github.com/apache/beam/issues/24313 [Flaky]: apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder https://github.com/apache/beam/issues/24267 [Failing Test]: Timeout waiting to lock gradle https://github.com/apache/beam/issues/23944 beam_PreCommit_Python_Cron regularily failing - test_pardo_large_input flaky https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle https://github.com/apache/beam/issues/23286 [Bug]: beam_PerformanceTests_InfluxDbIO_IT Flaky > 50 % Fail https://github.com/apache/beam/issues/22969 Discrepancy in behavior of `DoFn.process()` when `yield` is combined with `return` statement, or vice versa https://github.com/apache/beam/issues/22961 [Bug]: WriteToBigQuery silently skips most of records without job fail https://github.com/apache/beam/issues/22913 [Bug]: beam_PostCommit_Java_ValidatesRunner_Flink is flakes in org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get output to Failed Inserts PCollection https://github.com/apache/beam/issues/21695 DataflowPipelineResult does not raise exception for unsuccessful states. https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial (order 1000 elements) numpy input flakes in non-cython environment https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: Connection refused https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not follow spec https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit data at GC time https://github.com/apache/beam/issues/21121 apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it flakey https://github.com/apache/beam/issues/21104 Flaky: apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers https://github.com/apache/beam/issues/20976 apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics is flaky https://github.com/apache/beam/issues/20974 Python GHA PreCommits flake with grpc.FutureTimeoutError on SDK harness startup https://github.com/apache/beam/issues/20689 Kafka commitOffsetsInFinalize OOM on Flink https://github.com/apache/beam/issues/20108 Python direct runner doesn't emit empty pane when it should https://github.com/apache/beam/issues/19814 Flink streaming flakes in ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful and ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful https://github.com/apache/beam/issues/19241 Python Dataflow integration tests should export the pipeline Job ID and console output to Jenkins Test Result section P1 Issues with no update in the last week: https://github.com/apache/beam/issues/24464 [Epic]: Implement FileWriteSchemaTransformProvider https://github.com/apache/beam/issues/23875 [Bug]: beam.Row.__eq__ returns true for unequal rows https://github.com/apache/beam/issues/23306 [Bug]: BigQueryBatchFileLoads in python loses data when using WRITE_TRUNCATE https://github.com/apache/beam/issues/22115 [Bug]: apache_beam.runners.portability.portable_runner_test.PortableRunnerTestWithSubprocesses is flaky https://github.com/apache/beam/issues/21714 PulsarIOTest.testReadFromSimpleTopic is very flaky https://github.com/apache/beam/issues/21708 beam_PostCommit_Java_DataflowV2, testBigQueryStorageWrite30MProto failing consistently https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit test action StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer https://github.com/apache/beam/issues/21700 --dataflowServiceOptions=use_runner_v2 is broken https://github.com/apache/beam/issues/21645 beam_PostCommit_XVR_GoUsingJava_Dataflow fails on some test transforms https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table destinations returns wrong tableId https://github.com/apache/beam/issues/21424