Hi,

I created https://issues.apache.org/jira/browse/CASSANDRA-14241 for this issue. 
You are right there is a solid chunk of failing tests on Apache infrastructure 
that don't fail on CircleCI. I'll find someone to get it done.

I think that fix before commit is only going to happen if we go all the way and 
route every single commit through testing infrastructure that runs all the 
tests multiple times and refuses to merge commits unless the tests pass 
somewhat consistently. Short of that flakey (and hard failing) tests are going 
to keep creeping in (and even then). That's not feasible without much better 
infrastructure available to everyone and it's not a short term thing RN I 
think. I mean maybe we move forward with it on the Apache infrastructure we 
have.

I 'm not sure flakey infrastructure is what is acutely hurting us although we 
do have infrastructure that exposes unreliable tests although maybe that's just 
a matter of framing.

Dealing with flakey tests generally devolves into picking victim(s) via some 
process. Blocking releases on failing tests is a way of picking the people who 
want the next release as victims. Blocking commits on flakey tests is a way of 
making people who want to merge stuff the victim. Doing nothing is making some 
random subset of volunteers who fix the tests as well as all developers who run 
the tests victims as well as end users to a certain extent. Excluding tests and 
running tests multiple times is picking the end user of releases as the victim.

RE multi-pronged. We are currently using a flaky annotation that reruns tests, 
we have skipped tests with JIRAs, we are are re-running tests right now if they 
fail for certain classes of reasons. So we are currently down that road right 
now. I think it's fine but we need a backpressure mechanism because we can't 
keep accruing this kind of thing forever.

In my mind processes for keeping the tests passing need to provide two 
functions, pick victims(s) (task management), and create backpressure (slow new 
development to match defect rate). It seems possible to create backpressure by 
blocking releases, but that fails to pick victims to an extent. Many people 
running C* are so far behind they aren't waiting on that next release. Or they 
are accustomed to running a private fork and backporting. When we were able to 
block commits via informal process I think it helped, but an informal process 
has limitations.

I think blocking commits via automation is going to spread the load out most 
evenly and make it a priority for everyone in the contributor base. We have 16 
apache nodes to work with which I think would handle our current commit load. 
We can fine tune criteria for blocking commits as we go.

I don't have an answer for how we backpressure the utilization of flakey 
annotations and re-running tests. Maybe it's a czar saying no commits until we 
reach some goal done on a period (every 3 months). Maybe we vote on it 
periodically. Czars can be really effective in moving the herd. The Czar does 
need to be able to wield something to motivate some set of contributors to do 
the work. It's not so much about preventing the commits as it is signaling 
unambiguously that this is what we are working on now and if you aren't you are 
working on the wrong thing. It ends up being quite depressing though when you 
end up working through significant amounts of tech debt all at once. It hurts 
less when you have a lot of people working on it.

Ariel

On Thu, Feb 15, 2018, at 6:48 PM, kurt greaves wrote:
> It seems there has been a bit of a slip in testing as of recently, mostly
> due to the fact that there's no canonical testing environment that isn't
> flaky. We probably need to come up with some ideas and a plan on how we're
> going to do testing in the future, and how we're going to make testing
> accessible for all contributors. I think this is the only way we're really
> going to change behaviour. Having an incredibly tedious process and then
> being aggressive about it only leads to resentment and workarounds.
> 
> I'm completely unsure of where dtests are at since the conversion to
> pytest, and there's a lot of failing dtests on the ASF jenkins jobs (which
> appear to be running pytest). As there's currently not a lot of visibility
> into what people are doing with CircleCI for this it's hard to say if
> things are better over there. I'd like to help here if anyone wants to fill
> me in.
> 
> On 15 February 2018 at 21:14, Josh McKenzie <jmcken...@apache.org> wrote:
> 
> > >
> > > We’ve said in the past that we don’t release without green tests. The PMC
> > > gets to vote and enforce it. If you don’t vote yes without seeing the
> > test
> > > results, that enforces it.
> >
> > I think this is noble and ideal in theory. In practice, the tests take long
> > enough, hardware infra has proven flaky enough, and the tests *themselves*
> > flaky enough, that there's been a consistent low-level of test failure
> > noise that makes separating signal from noise in this context very time
> > consuming. Reference 3.11-test-all for example re:noise:
> > https://builds.apache.org/view/A-D/view/Cassandra/job/
> > Cassandra-3.11-test-all/test/?width=1024&height=768
> >
> > Having spearheaded burning test failures to 0 multiple times and have them
> > regress over time, my gut intuition is we should have one person as our
> > Source of Truth with a less-flaky source for release-vetting CI (dedicated
> > hardware, circle account, etc) we can use as a reference to vote on release
> > SHA's.
> >
> > We’ve declared this a requirement multiple times
> >
> > Declaring things != changed behavior, and thus != changed culture. The
> > culture on this project is one of having a constant low level of test
> > failure noise in our CI as a product of our working processes. Unless we
> > change those (actually block release w/out green board, actually
> > aggressively block merge w/any failing tests, aggressively retroactively
> > track down test failures on a daily basis and RCA), the situation won't
> > improve. Given that this is a volunteer organization / project, that kind
> > of daily time investment is a big ask.
> >
> > On Thu, Feb 15, 2018 at 1:10 PM, Jeff Jirsa <jji...@gmail.com> wrote:
> >
> > > Moving this to it’s own thread:
> > >
> > > We’ve declared this a requirement multiple times and then we occasionally
> > > get a critical issue and have to decide whether it’s worth the delay. I
> > > assume Jason’s earlier -1 on attempt 1 was an enforcement of that earlier
> > > stated goal.
> > >
> > > It’s up to the PMC. We’ve said in the past that we don’t release without
> > > green tests. The PMC gets to vote and enforce it. If you don’t vote yes
> > > without seeing the test results, that enforces it.
> > >
> > > --
> > > Jeff Jirsa
> > >
> > >
> > > > On Feb 15, 2018, at 9:49 AM, Josh McKenzie <jmcken...@apache.org>
> > wrote:
> > > >
> > > > What would it take for us to get green utest/dtests as a blocking part
> > of
> > > > the release process? i.e. "for any given SHA, here's a link to the
> > tests
> > > > that passed" in the release vote email?
> > > >
> > > > That being said, +1.
> > > >
> > > >> On Wed, Feb 14, 2018 at 4:33 PM, Nate McCall <zznat...@gmail.com>
> > > wrote:
> > > >>
> > > >> +1
> > > >>
> > > >> On Thu, Feb 15, 2018 at 9:40 AM, Michael Shuler <
> > mich...@pbandjelly.org
> > > >
> > > >> wrote:
> > > >>> I propose the following artifacts for release as 3.0.16.
> > > >>>
> > > >>> sha1: 890f319142ddd3cf2692ff45ff28e71001365e96
> > > >>> Git:
> > > >>> http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=
> > > >> shortlog;h=refs/tags/3.0.16-tentative
> > > >>> Artifacts:
> > > >>> https://repository.apache.org/content/repositories/
> > > >> orgapachecassandra-1157/org/apache/cassandra/apache-cassandra/3.0.16/
> > > >>> Staging repository:
> > > >>> https://repository.apache.org/content/repositories/
> > > >> orgapachecassandra-1157/
> > > >>>
> > > >>> Debian and RPM packages are available here:
> > > >>> http://people.apache.org/~mshuler
> > > >>>
> > > >>> *** This release addresses an important fix for CASSANDRA-14092 ***
> > > >>>    "Max ttl of 20 years will overflow localDeletionTime"
> > > >>>    https://issues.apache.org/jira/browse/CASSANDRA-14092
> > > >>>
> > > >>> The vote will be open for 72 hours (longer if needed).
> > > >>>
> > > >>> [1]: (CHANGES.txt) https://goo.gl/rLj59Z
> > > >>> [2]: (NEWS.txt) https://goo.gl/EkrT4G
> > > >>>
> > > >>> ------------------------------------------------------------
> > ---------
> > > >>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > >>> For additional commands, e-mail: dev-h...@cassandra.apache.org
> > > >>>
> > > >>
> > > >> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > >> For additional commands, e-mail: dev-h...@cassandra.apache.org
> > > >>
> > > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >
> > >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Reply via email to