Streaming update compatibility

2023-10-25 Thread Robert Bradshaw via dev
Dataflow (among other runners) has the ability to "upgrade" running
pipelines with new code (e.g. capturing bug fixes, dependency updates,
and limited topology changes). Unfortunately some improvements (e.g.
new and improved ways of writing to BigQuery, optimized use of side
inputs, a change in algorithm, sometimes completely internally and not
visible to the user) are not sufficiently backwards compatible which
causes us, with the motivation to not break users, to either not make
these changes or guard them as a parallel opt-in mode which is a
significant drain on both developer productivity and causes new
pipelines to run in obsolete modes by default.

I created https://github.com/apache/beam/pull/29140 which adds a new
pipeline option, update_compatibility_version, that allows the SDK to
move forward while letting users with pipelines launched previously to
manually request the "old" way of doing things to preserve update
compatibility. (We should still attempt backwards compatibility when
it makes sense, and the old way would remain in code until such a time
it's actually deprecated and removed, but this means we won't be
constrained by it, especially when it comes to default settings.)

Any objections or other thoughts on this approach?

- Robert

P.S. Separately I think it'd be valuable to elevate the vague notion
of update compatibility to a first-class Beam concept and put it on
firm footing, but that's a larger conversation outside the thread of
this smaller (and I think still useful in such a future world) change.


Re: [Discuss] Idea to increase RC voting participation

2023-10-25 Thread Danny McCormick via dev
> One easy and standard way to make it more resilient would be to make it
idempotent instead of counting on uptime or receiving any particular event.

Yep, agreed that this wouldn't be super hard if someone wants to take it
on. Basically it would just be updating the tool to run on a schedule and
look for issues that have been closed as completed in the last N days (more
or less this query -
https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01+).
I have seen some milestones intentionally removed from issues after the bot
adds them (probably because it's non-obvious that you can mark an issue as
not planned instead), so we'd probably want to account for that and no-op
if a milestone was removed post-close.

One downside of this approach is that you significantly increase the
chances of an issue getting misassigned to the wrong milestone if it comes
in around the cut; you'd need to either account for this by checking out
the repo to get the version at the time the issue was closed
(expensive/non-trivial) or live with this downside. It's probably an ok
downside to live with.

You could also do a hybrid approach where you run on issue close and run a
scheduled or manual pre-release step to clean up any stragglers. This would
be the most robust option.

On Wed, Oct 25, 2023 at 7:43 AM Kenneth Knowles  wrote:

> Agree. As long as we are getting enough of them, then our records as well
> as any automation depending on it are fine. One easy and standard way to
> make it more resilient would be to make it idempotent instead of counting
> on uptime or receiving any particular event.
>
> Kenn
>
> On Tue, Oct 24, 2023 at 2:58 PM Danny McCormick 
> wrote:
>
>> Looks like for some reason the workflow didn't trigger. This is running
>> on GitHub's hosted runners, so my best guess is an outage.
>>
>> Looking at a more refined query, this year there have been 14 issues that
>> were missed by the automation (3 had their milestone manually removed) -
>> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01
>>  out
>> of 605 total -
>> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01+
>>  -
>> as best I can tell there were a small number of workflow flakes and then
>> GHA didn't correctly trigger a few.
>>
>> If we wanted, we could set up some recurring automation to go through and
>> try to pick up the ones without milestones (or modify our existing
>> automation to be more tolerant to failures), but it doesn't seem super
>> urgent to me (feel free to disagree). I don't think this piece needs to be
>> perfect.
>>
>> On Tue, Oct 24, 2023 at 2:40 PM Kenneth Knowles  wrote:
>>
>>> Just grabbing one at random for an example,
>>> https://github.com/apache/beam/issues/28635 seems like it was closed as
>>> completed but not tagged.
>>>
>>> I'm happy to see that the bot reads the version from the repo to find
>>> the appropriate milestone, rather than using the nearest open one. Just
>>> recording that for the thread since I first read the description as the
>>> latter.
>>>
>>> Kenn
>>>
>>> On Tue, Oct 24, 2023 at 2:34 PM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
 We do tag issues to milestones when the issue is marked as "completed"
 (as opposed to "not planned") -
 https://github.com/apache/beam/blob/master/.github/workflows/assign_milestone.yml.
 So I think using issues is probably about as accurate as using commits.

 > It looks like we have 820 with no milestone
 https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed

 Most predate the automation, though maybe not all? Some of those may
 have been closed as "not planned".

 > This could (should) be automatically discoverable. A (closed) issues
 is associated with commits which are associated with a release.

 Today, we just tag issues to the upcoming milestone when they're
 closed. In theory you could do something more sophisticated using linked
 commits, but in practice people aren't clean enough about linking commits
 to issues. Again, this is fixable by automation/enforcement, but I don't
 think it actually gives us much value beyond what we have today.

 On Tue, Oct 24, 2023 at 1:54 PM Robert Bradshaw via dev <
 dev@beam.apache.org> wrote:

> On Tue, Oct 24, 2023 at 10:35 AM Kenneth Knowles 
> wrote:
>
>> Tangentially related:
>>
>> Long ago, attaching an issue to a release was a mandatory step as
>> part of closing. Now I think it is not. Is it automatically happening? It
>> looks like we have 820 with no milestone
>> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed
>>
>
> This could (should) be automatically discoverable. A (closed) issues
> is associated with commits which are 

Re: [Discuss] Idea to increase RC voting participation

2023-10-25 Thread Kenneth Knowles
Agree. As long as we are getting enough of them, then our records as well
as any automation depending on it are fine. One easy and standard way to
make it more resilient would be to make it idempotent instead of counting
on uptime or receiving any particular event.

Kenn

On Tue, Oct 24, 2023 at 2:58 PM Danny McCormick 
wrote:

> Looks like for some reason the workflow didn't trigger. This is running on
> GitHub's hosted runners, so my best guess is an outage.
>
> Looking at a more refined query, this year there have been 14 issues that
> were missed by the automation (3 had their milestone manually removed) -
> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01
>  out
> of 605 total -
> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aclosed+reason%3Acompleted+created%3A%3E2023-01-01+
>  -
> as best I can tell there were a small number of workflow flakes and then
> GHA didn't correctly trigger a few.
>
> If we wanted, we could set up some recurring automation to go through and
> try to pick up the ones without milestones (or modify our existing
> automation to be more tolerant to failures), but it doesn't seem super
> urgent to me (feel free to disagree). I don't think this piece needs to be
> perfect.
>
> On Tue, Oct 24, 2023 at 2:40 PM Kenneth Knowles  wrote:
>
>> Just grabbing one at random for an example,
>> https://github.com/apache/beam/issues/28635 seems like it was closed as
>> completed but not tagged.
>>
>> I'm happy to see that the bot reads the version from the repo to find the
>> appropriate milestone, rather than using the nearest open one. Just
>> recording that for the thread since I first read the description as the
>> latter.
>>
>> Kenn
>>
>> On Tue, Oct 24, 2023 at 2:34 PM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> We do tag issues to milestones when the issue is marked as "completed"
>>> (as opposed to "not planned") -
>>> https://github.com/apache/beam/blob/master/.github/workflows/assign_milestone.yml.
>>> So I think using issues is probably about as accurate as using commits.
>>>
>>> > It looks like we have 820 with no milestone
>>> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed
>>>
>>> Most predate the automation, though maybe not all? Some of those may
>>> have been closed as "not planned".
>>>
>>> > This could (should) be automatically discoverable. A (closed) issues
>>> is associated with commits which are associated with a release.
>>>
>>> Today, we just tag issues to the upcoming milestone when they're closed.
>>> In theory you could do something more sophisticated using linked commits,
>>> but in practice people aren't clean enough about linking commits to issues.
>>> Again, this is fixable by automation/enforcement, but I don't think it
>>> actually gives us much value beyond what we have today.
>>>
>>> On Tue, Oct 24, 2023 at 1:54 PM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>>
 On Tue, Oct 24, 2023 at 10:35 AM Kenneth Knowles 
 wrote:

> Tangentially related:
>
> Long ago, attaching an issue to a release was a mandatory step as part
> of closing. Now I think it is not. Is it automatically happening? It looks
> like we have 820 with no milestone
> https://github.com/apache/beam/issues?q=is%3Aissue+no%3Amilestone+is%3Aclosed
>

 This could (should) be automatically discoverable. A (closed) issues is
 associated with commits which are associated with a release.


> On Tue, Oct 24, 2023 at 1:25 PM Chamikara Jayalath via dev <
> dev@beam.apache.org> wrote:
>
>> +1 for going by the commits since this is what matters at the end of
>> the day. Also, many issues may not get tagged correctly for a given 
>> release
>> due to either the contributor not tagging the issue or due to commits for
>> the issue spanning multiple Beam releases.
>>
>> For example,
>>
>> For all commits in a given release RC:
>>   * If we find a Github issue for the commit: add a notice to the
>> Github issue
>>   * Else: add the notice to a generic issue for the release including
>> tags for the commit ID, PR author, and the committer who merged the PR.
>>
>> Thanks,
>> Cham
>>
>>
>>
>>
>> On Mon, Oct 23, 2023 at 11:49 AM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> I'd probably vote to include both the issue filer and the
>>> contributor. It is pretty equally straightforward - one way to do this
>>> would be using all issues related to that release's milestone and
>>> extracting the issue author and the issue closer.
>>>
>>> This does leave out the (unfortunately sizable) set of contributions
>>> that don't have an associated issue; if we're worried about that, we 
>>> could
>>> always fall back to anyone with a commit in the last release who 

Beam High Priority Issue Report (46)

2023-10-25 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/29099 [Bug]: FnAPI Java SDK Harness 
doesn't update user counters in OnTimer callback functions
https://github.com/apache/beam/issues/29076 [Failing Test]: Python ARM 
PostCommit failing after #28385
https://github.com/apache/beam/issues/29022 [Failing Test]: Python Github 
actions tests are failing due to update of pip 
https://github.com/apache/beam/issues/28760 [Bug]: EFO Kinesis IO reader 
provided by apache beam does not pick the event time for watermarking
https://github.com/apache/beam/issues/28703 [Failing Test]: Building a wheel 
for integration tests sometimes times out
https://github.com/apache/beam/issues/28383 [Failing Test]: 
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric
https://github.com/apache/beam/issues/28339 Fix failing 
"beam_PostCommit_XVR_GoUsingJava_Dataflow" job
https://github.com/apache/beam/issues/28326 Bug: 
apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working
https://github.com/apache/beam/issues/28142 [Bug]: [Go SDK] Memory seems to be 
leaking on 2.49.0 with Dataflow
https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not 
working when using CreateDisposition.CREATE_IF_NEEDED 
https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. 
PeriodicImpulse) running in Flink and polling using tracker.defer_remainder 
have checkpoint size growing indefinitely 
https://github.com/apache/beam/issues/27616 [Bug]: Unable to use 
applyRowMutations() in bigquery IO apache beam java
https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with 
inequality filters
https://github.com/apache/beam/issues/27314 [Failing Test]: 
bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1]
https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when 
using Kafka and GroupByKey on Dataflow Runner
https://github.com/apache/beam/issues/26981 [Bug]: Getting an error related to 
SchemaCoder after upgrading to 2.48
https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested 
ROW (described below)
https://github.com/apache/beam/issues/26343 [Bug]: 
apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is 
flaky
https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not 
propagate a Coder to AvroSource
https://github.com/apache/beam/issues/26041 [Bug]: Unable to create 
exactly-once Flink pipeline with stream source and file sink
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121