Re: [Request for Feedback] Swift SDK Prototype

2023-09-14 Thread Byron Ellis via dev
Hi all,

I've been on vacation, but mostly working on getting External Transform
support going (which in turn basically requires Schema support as well). It
also looks like macros landed in Swift 5.9 for Linux so we'll be able to
use those to do some compile-time automation. In particular, this lets us
do something similar to what Java does with ByteBuddy for generating schema
coders though it has to be ahead of time so not quite the same. (As far as
I can tell this is a reason why macros got added to the language in the
first place---Apple's SwiftData library makes heavy use of the feature).

I do have one question for the group though: should the Swift SDK
distribution take on Beam community properties or Swift community
properties? Specifically, in the Swift world the Swift SDK would live in
its own repo (beam-swift for example), which allows it to be most easily
consumed and keeps the checkout size under control for users. "Releases" in
the Swift world (much like Go) are just repo tags. The downside here is
that there's overhead in setting up the various github actions and other
CI/CD bits and bobs.

The alternative would be to keep it in the beam repo itself like it is now,
but we'd probably want to move Package.swift to the root since for whatever
reason the Swift community (much to some people's annoyance) has chosen to
have packages only really able to live at the top of a repo. This has less
overhead from a CI/CD perspective, but lots of overhead for users as they'd
be checking out the entire Beam repo to use the SDK, which happens a lot.

There's a third option which is basically "do both" but honestly that just
seems like the worst of both worlds as it would require constant syncing if
we wanted to make it possible for Swift users to target unreleased SDKs for
development and testing.

Personally, I would lean towards the former option (and would volunteer to
set up & document the various automations) as it is lighter for the actual
users of the SDK and more consistent with the community experience they
expect. The CI/CD stuff is mostly a "do it once" whereas checking out the
entire repo with many updates the user doesn't care about is something they
will be doing all the time. FWIW some of our dependencies also chose this
route---most notably GRPC which started with the latter approach and has
moved to the former.

Interested to hear any feedback on the subject since I'm guessing it
probably came up with the Go SDK back in the day?

Best,
B



On Tue, Aug 29, 2023 at 7:59 AM Byron Ellis  wrote:

> After a couple of iterations (thanks rebo!) we've also gotten the Swift
> SDK working with the new Prism runner. The fact that it doesn't do fusion
> caught a couple of configuration bugs (e.g. that the grpc message receiver
> buffer should be fairly large). It would seem that at the moment Prism and
> the Flink runner have similar orders of strictness when interpreting the
> pipeline graph while the Python portable runner is far more forgiving.
>
> Also added support for bounded vs unbounded pcollections through the
> "type" parameter when adding a pardo. Impulse is a bounded pcollection I
> believe?
>
> On Fri, Aug 25, 2023 at 2:04 PM Byron Ellis  wrote:
>
>> Okay, after a brief detour through "get this working in the Flink
>> Portable Runner" I think I have something pretty workable.
>>
>> PInput and POutput can actually be structs rather than protocols, which
>> simplifies things quite a bit. It also allows us to use them with property
>> wrappers for a SwiftUI-like experience if we want when defining DoFns
>> (which is what I was originally intending to use them for). That also means
>> the function signature you use for closures would match full-fledged DoFn
>> definitions for the most part which is satisfying.
>>
>>
>>
>> On Thu, Aug 24, 2023 at 5:55 PM Byron Ellis 
>> wrote:
>>
>>> Okay, I tried a couple of different things.
>>>
>>> Implicitly passing the timestamp and window during iteration did not go
>>> well. While physically possible it introduces an invisible side effect into
>>> loop iteration which confused me when I tried to use it and I implemented
>>> it. Also, I'm pretty sure there'd end up being some sort of race condition
>>> nightmare continuing down that path.
>>>
>>> What I decided to do instead was the following:
>>>
>>> 1. Rename the existing "pardo" functions to "pstream" and require that
>>> they always emit a window and timestamp along with their value. This
>>> eliminates the side effect but lets us keep iteration in a bundle where
>>> that might be convenient. For example, in my cheesy GCS implementation it
>>> means that I can keep an OAuth token around for the lifetime of the bundle
>>> as a local variable, which is convenient. It's a bit more typing for users
>>> of pstream, but the expectation here is that if you're using pstream
>>> functions You Know What You Are Doing and most people won't be using it
>>> directly.
>>>
>>> 2. Introduce a new set of pardo function

Beam High Priority Issue Report (42)

2023-09-14 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/28383 [Failing Test]: 
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric
https://github.com/apache/beam/issues/28339 Fix failing 
"beam_PostCommit_XVR_GoUsingJava_Dataflow" job
https://github.com/apache/beam/issues/28326 Bug: 
apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working
https://github.com/apache/beam/issues/28168 [Bug]: BigQuery Storage Write API 
does not write with no complaint
https://github.com/apache/beam/issues/28142 [Bug]: [Go SDK] Memory seems to be 
leaking on 2.49.0 with Dataflow
https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not 
working when using CreateDisposition.CREATE_IF_NEEDED 
https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. 
PeriodicImpulse) running in Flink and polling using tracker.defer_remainder 
have checkpoint size growing indefinitely 
https://github.com/apache/beam/issues/27616 [Bug]: Unable to use 
applyRowMutations() in bigquery IO apache beam java
https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with 
inequality filters
https://github.com/apache/beam/issues/27314 [Failing Test]: 
bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1]
https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when 
using Kafka and GroupByKey on Dataflow Runner
https://github.com/apache/beam/issues/26981 [Bug]: Getting an error related to 
SchemaCoder after upgrading to 2.48
https://github.com/apache/beam/issues/26969 [Failing Test]: Python PostCommit 
is failing due to exceeded rate limits
https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested 
ROW (described below)
https://github.com/apache/beam/issues/26354 [Bug]: BigQueryIO direct read not 
reading all rows when set --setEnableBundling=true
https://github.com/apache/beam/issues/26343 [Bug]: 
apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is 
flaky
https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not 
propagate a Coder to AvroSource
https://github.com/apache/beam/issues/26041 [Bug]: Unable to create 
exactly-once Flink pipeline with stream source and file sink
https://github.com/apache/beam/issues/25975 [Bug]: Reducing parallelism in 
FlinkRunner leads to a data loss
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21708 beam_PostCommit_Java_DataflowV2, 
testBigQueryStorageWrite30MProto failing consistently
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_te

Re: [PROPOSAL] Preparing for 2.51.0 Release

2023-09-14 Thread Jean-Baptiste Onofré
Awesome ! Thanks Kenn !

Regards
JB

On Thu, Sep 14, 2023 at 3:20 AM Kenneth Knowles  wrote:
>
> Hello Beam community!
>
> The next release (2.51.0) branch cut is scheduled for September 20, 2023, one 
> week from today, according to the release calendar [1].
>
> I'd like to volunteer to perform this release. My plan is to cut the branch 
> on that date, and cherrypick release-blocking fixes afterwards, if any.
>
> Please help me make sure the release goes smoothly by:
>
> - Making sure that any unresolved release blocking issues for 2.51.0 should 
> have their "Milestone" marked as "2.51.0 Release" as soon as possible.
>
> - Reviewing the current release blockers [2] and remove the Milestone if they 
> don't meet the criteria at [3]. There are currently 12 release blockers.
>
> Let me know if you have any comments/objections/questions.
>
> Thanks,
>
> Kenn
>
> [1]
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
> [2] https://github.com/apache/beam/milestone/15
> [3] https://beam.apache.org/contribute/release-blocking/