Re: [Request for Feedback] Swift SDK Prototype
Hi all, I've been on vacation, but mostly working on getting External Transform support going (which in turn basically requires Schema support as well). It also looks like macros landed in Swift 5.9 for Linux so we'll be able to use those to do some compile-time automation. In particular, this lets us do something similar to what Java does with ByteBuddy for generating schema coders though it has to be ahead of time so not quite the same. (As far as I can tell this is a reason why macros got added to the language in the first place---Apple's SwiftData library makes heavy use of the feature). I do have one question for the group though: should the Swift SDK distribution take on Beam community properties or Swift community properties? Specifically, in the Swift world the Swift SDK would live in its own repo (beam-swift for example), which allows it to be most easily consumed and keeps the checkout size under control for users. "Releases" in the Swift world (much like Go) are just repo tags. The downside here is that there's overhead in setting up the various github actions and other CI/CD bits and bobs. The alternative would be to keep it in the beam repo itself like it is now, but we'd probably want to move Package.swift to the root since for whatever reason the Swift community (much to some people's annoyance) has chosen to have packages only really able to live at the top of a repo. This has less overhead from a CI/CD perspective, but lots of overhead for users as they'd be checking out the entire Beam repo to use the SDK, which happens a lot. There's a third option which is basically "do both" but honestly that just seems like the worst of both worlds as it would require constant syncing if we wanted to make it possible for Swift users to target unreleased SDKs for development and testing. Personally, I would lean towards the former option (and would volunteer to set up & document the various automations) as it is lighter for the actual users of the SDK and more consistent with the community experience they expect. The CI/CD stuff is mostly a "do it once" whereas checking out the entire repo with many updates the user doesn't care about is something they will be doing all the time. FWIW some of our dependencies also chose this route---most notably GRPC which started with the latter approach and has moved to the former. Interested to hear any feedback on the subject since I'm guessing it probably came up with the Go SDK back in the day? Best, B On Tue, Aug 29, 2023 at 7:59 AM Byron Ellis wrote: > After a couple of iterations (thanks rebo!) we've also gotten the Swift > SDK working with the new Prism runner. The fact that it doesn't do fusion > caught a couple of configuration bugs (e.g. that the grpc message receiver > buffer should be fairly large). It would seem that at the moment Prism and > the Flink runner have similar orders of strictness when interpreting the > pipeline graph while the Python portable runner is far more forgiving. > > Also added support for bounded vs unbounded pcollections through the > "type" parameter when adding a pardo. Impulse is a bounded pcollection I > believe? > > On Fri, Aug 25, 2023 at 2:04 PM Byron Ellis wrote: > >> Okay, after a brief detour through "get this working in the Flink >> Portable Runner" I think I have something pretty workable. >> >> PInput and POutput can actually be structs rather than protocols, which >> simplifies things quite a bit. It also allows us to use them with property >> wrappers for a SwiftUI-like experience if we want when defining DoFns >> (which is what I was originally intending to use them for). That also means >> the function signature you use for closures would match full-fledged DoFn >> definitions for the most part which is satisfying. >> >> >> >> On Thu, Aug 24, 2023 at 5:55 PM Byron Ellis >> wrote: >> >>> Okay, I tried a couple of different things. >>> >>> Implicitly passing the timestamp and window during iteration did not go >>> well. While physically possible it introduces an invisible side effect into >>> loop iteration which confused me when I tried to use it and I implemented >>> it. Also, I'm pretty sure there'd end up being some sort of race condition >>> nightmare continuing down that path. >>> >>> What I decided to do instead was the following: >>> >>> 1. Rename the existing "pardo" functions to "pstream" and require that >>> they always emit a window and timestamp along with their value. This >>> eliminates the side effect but lets us keep iteration in a bundle where >>> that might be convenient. For example, in my cheesy GCS implementation it >>> means that I can keep an OAuth token around for the lifetime of the bundle >>> as a local variable, which is convenient. It's a bit more typing for users >>> of pstream, but the expectation here is that if you're using pstream >>> functions You Know What You Are Doing and most people won't be using it >>> directly. >>> >>> 2. Introduce a new set of pardo function
Beam High Priority Issue Report (42)
This is your daily summary of Beam's current high priority issues that may need attention. See https://beam.apache.org/contribute/issue-priorities for the meaning and expectations around issue priorities. Unassigned P1 Issues: https://github.com/apache/beam/issues/28383 [Failing Test]: org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric https://github.com/apache/beam/issues/28339 Fix failing "beam_PostCommit_XVR_GoUsingJava_Dataflow" job https://github.com/apache/beam/issues/28326 Bug: apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working https://github.com/apache/beam/issues/28168 [Bug]: BigQuery Storage Write API does not write with no complaint https://github.com/apache/beam/issues/28142 [Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not working when using CreateDisposition.CREATE_IF_NEEDED https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. PeriodicImpulse) running in Flink and polling using tracker.defer_remainder have checkpoint size growing indefinitely https://github.com/apache/beam/issues/27616 [Bug]: Unable to use applyRowMutations() in bigquery IO apache beam java https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with inequality filters https://github.com/apache/beam/issues/27314 [Failing Test]: bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1] https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when using Kafka and GroupByKey on Dataflow Runner https://github.com/apache/beam/issues/26981 [Bug]: Getting an error related to SchemaCoder after upgrading to 2.48 https://github.com/apache/beam/issues/26969 [Failing Test]: Python PostCommit is failing due to exceeded rate limits https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested ROW (described below) https://github.com/apache/beam/issues/26354 [Bug]: BigQueryIO direct read not reading all rows when set --setEnableBundling=true https://github.com/apache/beam/issues/26343 [Bug]: apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is flaky https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not propagate a Coder to AvroSource https://github.com/apache/beam/issues/26041 [Bug]: Unable to create exactly-once Flink pipeline with stream source and file sink https://github.com/apache/beam/issues/25975 [Bug]: Reducing parallelism in FlinkRunner leads to a data loss https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK Harness ProcessBundleProgress https://github.com/apache/beam/issues/24389 [Failing Test]: HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError ContainerFetchException https://github.com/apache/beam/issues/24313 [Flaky]: apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder https://github.com/apache/beam/issues/23944 beam_PreCommit_Python_Cron regularily failing - test_pardo_large_input flaky https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder will drop message id and orderingKey https://github.com/apache/beam/issues/22913 [Bug]: beam_PostCommit_Java_ValidatesRunner_Flink is flakes in org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it https://github.com/apache/beam/issues/21714 PulsarIOTest.testReadFromSimpleTopic is very flaky https://github.com/apache/beam/issues/21708 beam_PostCommit_Java_DataflowV2, testBigQueryStorageWrite30MProto failing consistently https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit test action StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial (order 1000 elements) numpy input flakes in non-cython environment https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table destinations returns wrong tableId https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: Connection refused https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) failing: ParDoTest$TimestampTests/OnWindowExpirationTests https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not follow spec https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit data at GC time https://github.com/apache/beam/issues/21121 apache_beam.examples.streaming_wordcount_it_te
Re: [PROPOSAL] Preparing for 2.51.0 Release
Awesome ! Thanks Kenn ! Regards JB On Thu, Sep 14, 2023 at 3:20 AM Kenneth Knowles wrote: > > Hello Beam community! > > The next release (2.51.0) branch cut is scheduled for September 20, 2023, one > week from today, according to the release calendar [1]. > > I'd like to volunteer to perform this release. My plan is to cut the branch > on that date, and cherrypick release-blocking fixes afterwards, if any. > > Please help me make sure the release goes smoothly by: > > - Making sure that any unresolved release blocking issues for 2.51.0 should > have their "Milestone" marked as "2.51.0 Release" as soon as possible. > > - Reviewing the current release blockers [2] and remove the Milestone if they > don't meet the criteria at [3]. There are currently 12 release blockers. > > Let me know if you have any comments/objections/questions. > > Thanks, > > Kenn > > [1] > https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com > [2] https://github.com/apache/beam/milestone/15 > [3] https://beam.apache.org/contribute/release-blocking/