I do like the suggestions offered up by Dale and would encourage (or even
plead) with my fellow contributors to consider these:

  *   Allow code owners to override the block, if they can be convinced the
> override is justified.
>   *   Exclude troublesome tests from stress test runs, either via
> annotations or via an `assumeThat(…)` that can detect that it’s running as
> a stress test. Whatever the mechanism for excluding, it would be in the
> code, and therefore subject to code owner review. (This, too, feels overly
> broad to me, as it would exclude the test from all stress test runs.)
>   *   A way to exclude a specific test method from running in the stress
> tests for a specific PR or commit. I don’t have any ideas for how to
> declare such an exclusion, but if it could be done via a file it would be
> subject to code owner review.


1) Allow code owners to override the block, if they can be convinced the
override is justified.

After all, if we don't trust our code owners...

2-3) Use a custom annotation to exclude the test method or test class only
from stress-new-test.

At first I really liked this idea, but then we end up with growing a
collection of flaky tests that are excluded in some way from
stress-new-test that still occasionally fail in distributedTest.

#1 really sounds like the best option to me. I believe that leaving our
stress-new-test process as-is will only discourage everyone from fixing one
or two flaky tests in a large dunit test. However, I also believe that if
we give code owners the authority to override stress-new-test, then we
need to encourage them not to override this too often.

On Tue, Jun 8, 2021 at 11:11 AM Dale Emery <dem...@vmware.com> wrote:

> Maybe we can find a way to relax the requirement, or to allow addressing
> specific situations like the tangle you find yourself in.
>
> Removing the requirement altogether feels overly broad. I fear it would
> allow us to quietly disregard all intermittent test failures, and I think
> we already quietly (or even actively) disregard way too many kinds of
> failures.
>
> I would prefer some way to explicitly disregard only the specific test
> failures that prevent us from merging, and only with some amount of
> explicit justification.
>
> I’m not sure what that would look like. Some half-baked possibilities:
>
>   *   Allow code owners to override the block, if they can be convinced
> the override is justified.
>   *   Exclude troublesome tests from stress test runs, either via
> annotations or via an `assumeThat(…)` that can detect that it’s running as
> a stress test. Whatever the mechanism for excluding, it would be in the
> code, and therefore subject to code owner review. (This, too, feels overly
> broad to me, as it would exclude the test from all stress test runs.)
>   *   A way to exclude a specific test method from running in the stress
> tests for a specific PR or commit. I don’t have any ideas for how to
> declare such an exclusion, but if it could be done via a file it would be
> subject to code owner review.
>
> Dale
>
> From: Kirk Lund <kl...@apache.org>
> Date: Tuesday, June 8, 2021 at 9:33 AM
> To: dev@geode.apache.org <dev@geode.apache.org>
> Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
> Our requirement for stress-new-test-openjdk11 to pass before allowing merge
> doesn't really work as intended for fixing distributed tests that contain
> multiple flaky test methods. In fact, I think it causes contributors to
> avoid tackling flaky tests.
>
> I've been working on GEODE-9103: CI Failure:
> PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
> <https://issues.apache.org/jira/browse/GEODE-9103> and was able to fix it.
>
> However, stress-new-test-openjdk11 then continued to fail for other flaky
> tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
> GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
> which remains flaky.
>
> Unfortunately, I cannot merge any of my fixes for
> PutAllClientServerDistributedTest unless every single flaky test in it is
> fixed by my PR. I think this is undesirable because it would be better to
> merge the fix for 3 flaky test methods than none.
>
> UPDATE: After running my precheckin more times, I did get
> stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
> loophole than anything because I didn't manage to fix GEODE-9242.
>
> Despite having PR #6542 eventually pass, I would like to discuss removing
> or relaxing our requirement that stress-new-test-openjdk11 must pass in
> order to merge a PR...
>
> PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
> PutAllClientServerDistributedTest
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Cdemery%40vmware.com%7Ca1f2e89aed9a4cb0940c08d92a9b2ac6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668190431746%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=CEeIvbWkOY%2B00EEVMUIqcsil%2BYUXLcIiyhSiVUxlD%2Fs%3D&amp;reserved=0
> >
>
> Fixed in PR #6542:
> * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
> testPartialKeyInPRSingleHopWithRedundancy
> <https://issues.apache.org/jira/browse/GEODE-9296>
> * GEODE-9103: CI Failure:
> PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
> <https://issues.apache.org/jira/browse/GEODE-9103>
> * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
> fails due to missing disk store after server restart
> <https://issues.apache.org/jira/browse/GEODE-8528>
>
> Still flaky:
> * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
> testEventIdOutOfOrderInPartitionRegionSingleHop
> <https://issues.apache.org/jira/browse/GEODE-9242>
>
> Thanks,
> Kirk
>

Reply via email to