The fix was inadvertently run in dry run mode so didn't make any
changes. Since the fix was taking a couple of hours or so and it
was getting late on Friday people didn't want to start it again
till today (after the weekend).
I don't think removing the few tests that run an unbounded
pipeline on Dataflow for a long term is a good idea. Sure, we can
disable them and re-enable them when there is an issue that is
blocking folks.
On Mon, Aug 16, 2021 at 9:19 AM Andrew Pilloud
<[email protected] <mailto:[email protected]>> wrote:
The two hours to estimated fix has long passed and we are now
at 18 days since the last successful run. What is the latest
estimate?
It sounds like these tests are primarily testing
Dataflow, not Beam. They seem like good candidates to remove
from the precommit (or limit to Dataflow runner changes) even
after they are fixed.
On Fri, Aug 13, 2021 at 6:48 PM Luke Cwik <[email protected]
<mailto:[email protected]>> wrote:
The failure is related due to data that is associated
with the apache-beam-testing project which is impacting
all the Dataflow streaming tests.
Yes, disabling the tests should have happened weeks ago if:
1) The fix seemed like it was going to take a long time
(was unknown at the time)
2) We had confidence in test coverage minus Dataflow
streaming test coverage (which I believe we did)
On Fri, Aug 13, 2021 at 6:27 PM Andrew Pilloud
<[email protected] <mailto:[email protected]>> wrote:
Or if a rollback won't fix this, can we disable the
broken tests?
On Fri, Aug 13, 2021 at 6:25 PM Andrew Pilloud
<[email protected] <mailto:[email protected]>> wrote:
So you can roll back in two hours. Beam has been
broken for two weeks. Why isn't a rollback
appropriate?
On Fri, Aug 13, 2021 at 6:06 PM Luke Cwik
<[email protected] <mailto:[email protected]>> wrote:
From the test failures that I have seen they
have been because of BEAM-12676[1] which is
due to a bug impacting Dataflow streaming
pipelines for the apache-beam-testing
project. The fix is rolling out now from my
understanding and should take another 2hrs or
so. Rolling back master doesn't seem like
what we should be doing at the moment.
1:
https://issues.apache.org/jira/projects/BEAM/issues/BEAM-12676
<https://issues.apache.org/jira/projects/BEAM/issues/BEAM-12676>
On Fri, Aug 13, 2021 at 5:51 PM Andrew
Pilloud <[email protected]
<mailto:[email protected]>> wrote:
Both java and python precommits are
reporting the last successful run being
in July (for both Cron and Precommit), so
it looks like changes are being
submitting without successful test runs.
We probably shouldn't be doing that?
https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/
<https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/>
https://ci-beam.apache.org/job/beam_PreCommit_Python_Commit/
<https://ci-beam.apache.org/job/beam_PreCommit_Python_Commit/>
https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Cron/
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Cron/>
https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Commit/
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Commit/>
Is there a plan to get this fixed? Should
we roll master back to July?
On Tue, Aug 3, 2021 at 12:24 PM Tyson
Hamilton <[email protected]
<mailto:[email protected]>> wrote:
I only realized after sending that I
used the IP for the link, that was by
accident, here is the proper domain
link:
http://metrics.beam.apache.org/d/D81lW0pmk/post-commit-test-reliability?orgId=1
<http://metrics.beam.apache.org/d/D81lW0pmk/post-commit-test-reliability?orgId=1>
On Tue, Aug 3, 2021 at 3:22 PM Tyson
Hamilton <[email protected]
<mailto:[email protected]>> wrote:
The way I've investigated
precommit flake stability is by
looking at the 'Post-commit Test
Reliability' [1] dashboard
(hah!). There is a cron job that
runs precommits and those results
are tracked in the post commit
dashboard confusingly. This week,
Java is about 50% green for the
pre-commit cron job, not great.
The plugin we installed for
tracking the most flaky tests for
a job doesn't do well for the
number of tests present in the
precommit cron job. This could be
an area of improvement to help
add granularity and visibility to
the flakiest tests over some
period of time.
[1]:
http://104.154.241.245/d/D81lW0pmk/post-commit-test-reliability?orgId=1
<http://104.154.241.245/d/D81lW0pmk/post-commit-test-reliability?orgId=1>
(look for "PreCommit_Java_Cron")
On Tue, Aug 3, 2021 at 2:24 PM
Andrew Pilloud
<[email protected]
<mailto:[email protected]>> wrote:
Our metrics show java is
nearly free from flakes, that
go has significant flakes,
and that python is
effectively broken. It
appears they may be missing
coverage on the Java side.
The dashboard is here:
http://104.154.241.245/d/McTAiu0ik/stability-critical-jobs-status?orgId=1
<http://104.154.241.245/d/McTAiu0ik/stability-critical-jobs-status?orgId=1>
I agree that this is
important to address. I
haven't submitted any code
recently but I spent a
significant amount of time on
the 2.31.0 release
investigating flakes in the
release validation tests.
Andrew
On Tue, Aug 3, 2021 at 10:43
AM Reuven Lax
<[email protected]
<mailto:[email protected]>> wrote:
I've noticed recently
that our precommit tests
are getting flakier and
flakier. Recently I had
to run Java PreCommit 5
times before I was able
to get a clean run. This
is frustrating for us as
developers, but it also
is extremely wasteful of
our compute resources.
I started making a list
of the flaky tests I've
seen. Here are some of
the ones I've dealt with
just the past few days;
this is not nearly an
exhaustive list - I've
seen many others before I
started recording them.
Of the below, failures in
ElasticsearchIOTest are
by far the most common!
We need to try and make
these tests not flaky.
Barring that, I think the
extremely flaky tests
need to be excluded from
our presubmit until they
can be fixed. Rerunning
the precommit over and
over again till green is
not a good testing strategy.
*
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming:
false]
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/3901/testReport/junit/org.apache.beam.runners.flink/ReadSourcePortableTest/testExecution_streaming__false_/>
*
org.apache.beam.sdk.io.jms.JmsIOTest.testCheckpointMarkSafety
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/18485/testReport/junit/org.apache.beam.sdk.io.jms/JmsIOTest/testCheckpointMarkSafety/>
*
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInFinishBundleStateful
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/3903/testReport/junit/org.apache.beam.sdk.transforms/ParDoLifecycleTest/testTeardownCalledAfterExceptionInFinishBundleStateful/>
*
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testSplit
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/3903/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testSplit/>
*
org.apache.beam.sdk.io.gcp.datastore.RampupThrottlingFnTest.testRampupThrottler
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/18501/testReport/junit/org.apache.beam.sdk.io.gcp.datastore/RampupThrottlingFnTest/testRampupThrottler/>