I think that there should be room for developers to bypass the stress-new-test 
jobs requirement in cases like this, where the test that's failing is an 
existing flaky test and that failure is blocking other flaky tests/bugs from 
being fixed. Splitting the class up won't save us, because stress-new-test runs 
on any changed or newly created test class, so no matter what you try and do, 
you'll have to run with the remaining flaky test at some point, which will 
block a PR. In Kirk's case, short of re-running the job until you get lucky and 
don't hit the existing flaky failure, there is no way to merge the fix for the 
other flaky tests other than bypassing stress-new-test or @Ignore-ing the 
existing flaky test (and if we're going to allow that approach, we may as well 
just ignore all of them until they're fixed).

We already allow stress-new-test to be skipped in situations where the number 
of modified test classes is large (like the recent change to use curly brackets 
on all control flow statements, which almost certainly would have hit flaky 
failures in pre-checkin if stress-new-test had run despite not changing any 
logic) so there's some precedence for bypassing the job when the value of the 
changes (or the cost of running the job) outweighs the benefits of ensuring 
that new flakiness isn't being introduced. However, I do think that in the 
majority of cases, stress-new-test should run, and should block PRs if a test 
fails, as there is real value being provided by it in terms of preventing new 
flaky tests being added.

I like the idea of allowing code owners to bypass stress-new-test failures, but 
I'm a little unclear on the practicalities of how this would actually work. How 
would this bypassing be achieved? If a PR has multiple code owners spanning 
multiple areas of code, would all of them be able to bypass stress-new-test, or 
just the code owners for the class showing the flaky failure? If the latter, 
what would happen if two classes with different code owners were modified and 
both had existing flaky tests in them? Would a separate override be needed from 
code owners for each class?

I think we need to strike a balance between making it *possible* to override 
stress-new-test failures and making it *easy* to override them, because I don't 
think it should be something that's ever done casually and without careful 
thought. With that in mind, I would suggest that it might be best to bring 
these rare cases to the dev list and make it a community decision (or at least 
not just a code-owner decision), even if that can be a slow and bureaucratic 
process at times.

Also, big +1 to the idea of having a continued, concerted effort to tackle 
flaky tests. Such efforts have been made in the past and were really 
successful, both in terms of reducing overall test flakiness and providing 
valuable insight into how to avoid creating flaky tests in the first place.
________________________________
From: Mark Hanson <hans...@vmware.com>
Sent: Wednesday, June 9, 2021 10:16 AM
To: dev@geode.apache.org <dev@geode.apache.org>
Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

One other thing to think about is perhaps having a rotating team to deal with 
flaky tests,  a small team commissioned every three or 6 months to clear out 
flaky tests for 1 month. It is good experience

Thanks,
Mark

On 6/9/21, 10:04 AM, "Mark Hanson" <hans...@vmware.com> wrote:

    One other thing is that we have code owners. If the PR submitter decides to 
ignore the stress new test, the code owner can still request a fix. That is 
probably a sufficient process.

    On 6/9/21, 10:02 AM, "Mark Hanson" <hans...@vmware.com> wrote:

        I agree, I am willing to concede this discussion, as long as we are 
judicious and specifically only use it when it is not our test that is failing. 
It is a real problem.

        Thanks,
        Mark
        On 6/9/21, 9:46 AM, "Kirk Lund" <kl...@apache.org> wrote:

            I did think about splitting up dunit tests, but I believe
            testEventIdOutOfOrderInPartitionRegionSingleHop will remain flaky 
even if I
            move it to a new dunit test. No matter how you dice it up, we end 
up with a
            PR that cannot be merged to develop unless you get lucky after 
running
            stress-new-test many times.

            One could try being tricky by marking it with @ignore or deleting 
the flaky
            test in one PR, and then re-add it in a 2nd PR. But, even then that 
2nd PR
            is very unlikely to pass stress-new-test unless someone fixes the 
cause of
            that test's flakiness.

            As it stands, stress-new-test just ends up being a dead-end or an 
endless
            time-sync for fixing multiple flaky tests in one dunit.

            On Tue, Jun 8, 2021 at 12:09 PM Dan Smith <dasm...@vmware.com> 
wrote:

            > Would it be possible to just split that test up into multiple 
classes? It
            > sounds like the issue is that there is so many flaky tests in 
that class
            > that you can't fix them all in one PR, which might indicate it's 
too big.
            >
            > If we can't get StressNewTest to pass - that means our builds are 
failing
            > more than 2% of the time due to this one test failure. Yikes!
            >
            > -Dan
            > ________________________________
            > From: Kirk Lund <kl...@apache.org>
            > Sent: Tuesday, June 8, 2021 9:33 AM
            > To: dev@geode.apache.org <dev@geode.apache.org>
            > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement 
from PRs
            >
            > Our requirement for stress-new-test-openjdk11 to pass before 
allowing merge
            > doesn't really work as intended for fixing distributed tests that 
contain
            > multiple flaky test methods. In fact, I think it causes 
contributors to
            > avoid tackling flaky tests.
            >
            > I've been working on GEODE-9103: CI Failure:
            > PutAllClientServerDistributedTest.testPutAllReturnsExceptions 
FAILED
            > <
            > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067757752%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=DBtwYMtKxKq7j1HZJLtkk4mE7PR7aIkuEPelAJzdb7w%3D&amp;reserved=0>
            > and was able to fix it.
            >
            > However, stress-new-test-openjdk11 then continued to fail for 
other flaky
            > tests in PutAllClientServerDistributedTest. I managed to fix 
GEODE-9296 and
            > GEODE-8528 as well. I also tried but have not been able to fix 
GEODE-9242
            > which remains flaky.
            >
            > Unfortunately, I cannot merge any of my fixes for
            > PutAllClientServerDistributedTest unless every single flaky test 
in it is
            > fixed by my PR. I think this is undesirable because it would be 
better to
            > merge the fix for 3 flaky test methods than none.
            >
            > UPDATE: After running my precheckin more times, I did get
            > stress-new-test-openjdk11 to pass once so I can merge, but that's 
more of a
            > loophole than anything because I didn't manage to fix GEODE-9242.
            >
            > Despite having PR #6542 eventually pass, I would like to discuss 
removing
            > or relaxing our requirement that stress-new-test-openjdk11 must 
pass in
            > order to merge a PR...
            >
            > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
            > PutAllClientServerDistributedTest
            > <
            > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067757752%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=e%2Bi8qHJJ55q8EjoEVDdjlSx0nF5MEsaIzybLFIOdbbA%3D&amp;reserved=0
            > >
            >
            > Fixed in PR #6542:
            > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
            > testPartialKeyInPRSingleHopWithRedundancy
            > <
            > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067757752%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=O06rBnGzNbBobYGLbnBrbCv%2FhqRUyYzvAkod6XKjjEM%3D&amp;reserved=0
            > >
            > * GEODE-9103: CI Failure:
            > PutAllClientServerDistributedTest.testPutAllReturnsExceptions 
FAILED
            > <
            > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067767717%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=zzuLonsCrI5T5PgFdp1a0bC2GzBk4Va17Ga5uRWWdsQ%3D&amp;reserved=0
            > >
            > * GEODE-8528: 
PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
            > fails due to missing disk store after server restart
            > <
            > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067767717%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=EX0QSFrj%2FvZXhGY8EuUEJUG5IMQH639C1ATfYF86mlo%3D&amp;reserved=0
            > >
            >
            > Still flaky:
            > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
            > testEventIdOutOfOrderInPartitionRegionSingleHop
            > <
            > 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067767717%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=b%2B8g00GsW8H8RxmgWrvgDZU6HsOPzTGLZCahhCvhhsI%3D&amp;reserved=0
            > >
            >
            > Thanks,
            > Kirk
            >



Reply via email to