Re: Finer-grained test runs?

Kenneth Knowles Mon, 13 Jul 2020 10:27:10 -0700

Having thought this over a bit, I think there are a few goals and they are
interfering with each other.


1. Clear signal for module / test suite health. This is a post-commit
concern. Post-commit jobs already all run as cronjobs with no
dependency-driven stuff.
2. Making precommit test signal stay non-flaky as modules, tests, and
flakiness increase.
3. Making precommit stay fast as modules, tests, and flakiness increase.

Noting the interdependence of pre-commit and post-commit:

 - you can phrase trigger post-commit jobs
 - pre-commit jobs are run as post-commits also

Summarizing a bit:

1. Clear per-module/suite greenness and flakiness signal
 - it would be nice if we could do this at the Gradle job level, but right
now it is Jenkins job
 - on the other hand, most Gradle jobs do not represent a module so that
could be too fine-grained and Jenkins jobs are better
 - if we have a ton of Jenkins jobs, we need some new automation or
amortized management
 - don't want to overwhelm the Jenkins executors, especially not causing
precommit queueing

2. Making precommit stay non-flaky, robustly
 - we can fix flakes, but can't count on that long term, but we could build
something that forces us to solve it at P0
 - we can add retry budget to tests where deflaking cannot be prioritized
 - a lot of anxiety that testing less in pre-commit will cause painful
post-commit debugging
 - a lot of overlap with making it faster, since the flakes are often
caused by irrelevant tests

3. Making precommit stay fast, robustly
 - we could improve per-worker incremental build
 - we could use a distributed build cache
 - we have tasks that don't do their input/output correctly that will have
problems

I care most about #1 and then also #2. The only reason I care about #3 is
because of #2: Once a pre-commit is more than a couple minutes, I always go
and do something else and come back in an hour or two. So if it flakes just
a few times, it costs a day. Fix #2 and I don't think #3 is urgent yet.

A distributed build cache seems to be fairly low effort to set up and makes
#2 and #3 better and may unlock approaches to #1. If we can fix our Gradle
configs. We can ask ASF infra if they have something already or can set it
up.

That will still leave open how to get better and more visible greenness and
flakiness signal at a more meaningful granularity.

Kenn

On Fri, Jul 10, 2020 at 6:38 AM Kenneth Knowles <k...@apache.org> wrote:

> On Thu, Jul 9, 2020 at 1:44 PM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> I wonder how hard it would be to track greenness and flakiness at the
>> level of gradle project (or even lower), viewed hierarchically.
>>
>
> Looks like this is part of the Gradle Enterprise Tests Dashboard offering:
> https://gradle.com/blog/flaky-tests/
>
> Kenn
>
> > Recall my (non-binding) starting point guessing at what tests should or
>> should not run in some scenarios: (this tangent is just about the third
>> one, where I explicitly said maybe we run all the same tests and then we
>> want to focus on separating signals as Luke pointed out)
>> >
>> > > - changing an IO or runner would not trigger the 20 minutes of core
>> SDK tests
>> > > - changing a runner would not trigger the long IO local integration
>> tests
>> > > - changing the core SDK could potentially not run as many tests in
>> presubmit, but maybe it would and they would be separately reported results
>> with clear flakiness signal
>> >
>> > And let's consider even more concrete examples:
>> >
>> >  - when changing a Fn API proto, how important is it to run
>> RabbitMqIOTest?
>> >  - when changing JdbcIO, how important is it to run the Java SDK
>> needsRunnerTests? RabbitMqIOTest?
>> >  - when changing the FlinkRunner, how important is it to make sure that
>> Nexmark queries still match their models when run on direct runner?
>> >
>> > I chose these examples to all have zero value, of course. And I've
>> deliberately included an example of a core change and a leaf test. Not all
>> (core change, leaf test) pairs are equally important. The vast majority of
>> all tests we run are literally unable to be affected by the changes
>> triggering the test. So that's why enabling Gradle cache or using a plugin
>> like Brian found could help part of the issue, but not the whole issue,
>> again as Luke reminded.
>>
>> For (2) and (3), I would hope that the build dependency graph could
>> exclude them. You're right about (1) (and I've hit that countless
>> times), but would rather err on the side of accidentally running too
>> many tests than not enough. If we make manual edits to what can be
>> inferred by the build graph, let's make it a blacklist rather than an
>> allow list to avoid accidental lost coverage.
>>
>> > We make these tradeoffs all the time, of course, via putting some tests
>> in *IT and postCommit runs and some in *Test, implicitly preCommit. But I
>> am imagining a future where we can decouple the test suite definitions
>> (very stable, not depending on the project context) from the decision of
>> where and when to run them (less stable, changing as the project changes).
>> >
>> > My assumption is that the project will only grow and all these problems
>> (flakiness, runtime, false coupling) will continue to get worse. I raised
>> this now so we could consider what is a steady state approach that could
>> scale, before it becomes an emergency. I take it as a given that it is
>> harder to change culture than it is to change infra/code, so I am not
>> considering any possibility of more attention to flaky tests or more
>> attention to testing the core properly or more attention to making tests
>> snappy or more careful consideration of *IT and *Test. (unless we build
>> infra that forces more attention to these things)
>> >
>> > Incidentally, SQL is not actually fully factored out. If you edit SQL
>> it runs a limited subset defined by :sqlPreCommit. If you edit core, then
>> :javaPreCommit still includes SQL tests.
>>
>> I think running SQL tests when you edit core is not actually that bad.
>> Possibly better than not running any of them. (Maybe, as cost becomes
>> more of a concern, adding the notion of "smoke tests" that are a cheap
>> subset run when upstream projects change would be a good compromise.)
>>
>

Re: Finer-grained test runs?

Reply via email to