I think that there should be room for developers to bypass the stress-new-test jobs requirement in cases like this, where the test that's failing is an existing flaky test and that failure is blocking other flaky tests/bugs from being fixed. Splitting the class up won't save us, because stress-new-test runs on any changed or newly created test class, so no matter what you try and do, you'll have to run with the remaining flaky test at some point, which will block a PR. In Kirk's case, short of re-running the job until you get lucky and don't hit the existing flaky failure, there is no way to merge the fix for the other flaky tests other than bypassing stress-new-test or @Ignore-ing the existing flaky test (and if we're going to allow that approach, we may as well just ignore all of them until they're fixed).
We already allow stress-new-test to be skipped in situations where the number of modified test classes is large (like the recent change to use curly brackets on all control flow statements, which almost certainly would have hit flaky failures in pre-checkin if stress-new-test had run despite not changing any logic) so there's some precedence for bypassing the job when the value of the changes (or the cost of running the job) outweighs the benefits of ensuring that new flakiness isn't being introduced. However, I do think that in the majority of cases, stress-new-test should run, and should block PRs if a test fails, as there is real value being provided by it in terms of preventing new flaky tests being added. I like the idea of allowing code owners to bypass stress-new-test failures, but I'm a little unclear on the practicalities of how this would actually work. How would this bypassing be achieved? If a PR has multiple code owners spanning multiple areas of code, would all of them be able to bypass stress-new-test, or just the code owners for the class showing the flaky failure? If the latter, what would happen if two classes with different code owners were modified and both had existing flaky tests in them? Would a separate override be needed from code owners for each class? I think we need to strike a balance between making it *possible* to override stress-new-test failures and making it *easy* to override them, because I don't think it should be something that's ever done casually and without careful thought. With that in mind, I would suggest that it might be best to bring these rare cases to the dev list and make it a community decision (or at least not just a code-owner decision), even if that can be a slow and bureaucratic process at times. Also, big +1 to the idea of having a continued, concerted effort to tackle flaky tests. Such efforts have been made in the past and were really successful, both in terms of reducing overall test flakiness and providing valuable insight into how to avoid creating flaky tests in the first place. ________________________________ From: Mark Hanson <hans...@vmware.com> Sent: Wednesday, June 9, 2021 10:16 AM To: dev@geode.apache.org <dev@geode.apache.org> Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs One other thing to think about is perhaps having a rotating team to deal with flaky tests, a small team commissioned every three or 6 months to clear out flaky tests for 1 month. It is good experience Thanks, Mark On 6/9/21, 10:04 AM, "Mark Hanson" <hans...@vmware.com> wrote: One other thing is that we have code owners. If the PR submitter decides to ignore the stress new test, the code owner can still request a fix. That is probably a sufficient process. On 6/9/21, 10:02 AM, "Mark Hanson" <hans...@vmware.com> wrote: I agree, I am willing to concede this discussion, as long as we are judicious and specifically only use it when it is not our test that is failing. It is a real problem. Thanks, Mark On 6/9/21, 9:46 AM, "Kirk Lund" <kl...@apache.org> wrote: I did think about splitting up dunit tests, but I believe testEventIdOutOfOrderInPartitionRegionSingleHop will remain flaky even if I move it to a new dunit test. No matter how you dice it up, we end up with a PR that cannot be merged to develop unless you get lucky after running stress-new-test many times. One could try being tricky by marking it with @ignore or deleting the flaky test in one PR, and then re-add it in a 2nd PR. But, even then that 2nd PR is very unlikely to pass stress-new-test unless someone fixes the cause of that test's flakiness. As it stands, stress-new-test just ends up being a dead-end or an endless time-sync for fixing multiple flaky tests in one dunit. On Tue, Jun 8, 2021 at 12:09 PM Dan Smith <dasm...@vmware.com> wrote: > Would it be possible to just split that test up into multiple classes? It > sounds like the issue is that there is so many flaky tests in that class > that you can't fix them all in one PR, which might indicate it's too big. > > If we can't get StressNewTest to pass - that means our builds are failing > more than 2% of the time due to this one test failure. Yikes! > > -Dan > ________________________________ > From: Kirk Lund <kl...@apache.org> > Sent: Tuesday, June 8, 2021 9:33 AM > To: dev@geode.apache.org <dev@geode.apache.org> > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs > > Our requirement for stress-new-test-openjdk11 to pass before allowing merge > doesn't really work as intended for fixing distributed tests that contain > multiple flaky test methods. In fact, I think it causes contributors to > avoid tackling flaky tests. > > I've been working on GEODE-9103: CI Failure: > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067757752%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DBtwYMtKxKq7j1HZJLtkk4mE7PR7aIkuEPelAJzdb7w%3D&reserved=0> > and was able to fix it. > > However, stress-new-test-openjdk11 then continued to fail for other flaky > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242 > which remains flaky. > > Unfortunately, I cannot merge any of my fixes for > PutAllClientServerDistributedTest unless every single flaky test in it is > fixed by my PR. I think this is undesirable because it would be better to > merge the fix for 3 flaky test methods than none. > > UPDATE: After running my precheckin more times, I did get > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a > loophole than anything because I didn't manage to fix GEODE-9242. > > Despite having PR #6542 eventually pass, I would like to discuss removing > or relaxing our requirement that stress-new-test-openjdk11 must pass in > order to merge a PR... > > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in > PutAllClientServerDistributedTest > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067757752%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=e%2Bi8qHJJ55q8EjoEVDdjlSx0nF5MEsaIzybLFIOdbbA%3D&reserved=0 > > > > Fixed in PR #6542: > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest > > testPartialKeyInPRSingleHopWithRedundancy > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067757752%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=O06rBnGzNbBobYGLbnBrbCv%2FhqRUyYzvAkod6XKjjEM%3D&reserved=0 > > > * GEODE-9103: CI Failure: > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067767717%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=zzuLonsCrI5T5PgFdp1a0bC2GzBk4Va17Ga5uRWWdsQ%3D&reserved=0 > > > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop > fails due to missing disk store after server restart > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067767717%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=EX0QSFrj%2FvZXhGY8EuUEJUG5IMQH639C1ATfYF86mlo%3D&reserved=0 > > > > Still flaky: > * GEODE-9242: CI failure in PutAllClientServerDistributedTest > > testEventIdOutOfOrderInPartitionRegionSingleHop > < > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067767717%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=b%2B8g00GsW8H8RxmgWrvgDZU6HsOPzTGLZCahhCvhhsI%3D&reserved=0 > > > > Thanks, > Kirk >