[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero

Jira Mon, 17 May 2021 05:36:05 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346130#comment-17346130
 ]


Andres de la Peña commented on CASSANDRA-16668:
-----------------------------------------------

Here are 10K runs of {{SEPExecutorTest}} with the patch, using the [CircleCI 
multiplexer|https://github.com/apache/cassandra/blob/trunk/doc/source/development/testing.rst#circleci]:
 * 
[j8-j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/457/workflows/cb5b1d27-75d4-4b3a-814c-04454fb4f4ef/jobs/4017]
 * 
[j8-j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/457/workflows/cb5b1d27-75d4-4b3a-814c-04454fb4f4ef/jobs/4015]
 * 
[j11-j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/457/workflows/1b82f571-9dd8-4a98-892b-1c0e3704f2d9/jobs/4013]

It seems that {{changingMaxWorkersMeetsConcurrencyGoalsTest}} happily survives 
the three 10K runs, but there are some uncommon failures on {{shutdownTest}}:
 * [j8-j8 runner 62 iteration 
49|https://4017-85817267-gh.circle-artifacts.com/62/stdout/fails/049/testsome-org.apache.cassandra.concurrent.SEPExecutorTest.txt]
 * [j8-j11 runner 73 iteration 
9|https://4015-85817267-gh.circle-artifacts.com/73/stdout/fails/009/testsome-org.apache.cassandra.concurrent.SEPExecutorTest.txt]
 * [j11-j11 runner 68 iteration 
51|https://4013-85817267-gh.circle-artifacts.com/68/stdout/fails/051/testsome-org.apache.cassandra.concurrent.SEPExecutorTest.txt]

Not sure whether that is related or an independent failure.

> Intermittent failure of 
> SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race 
> condition when shrinking maximum pool size to zero
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16668
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16668
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Other
>            Reporter: Matt Fleming
>            Assignee: Matt Fleming
>            Priority: Normal
>             Fix For: 4.0-rc
>
>
> A difficult-to-hit race condition exists in 
> changingMaxWorkersMeetsConcurrencyGoalsTest when changing the maximum pool 
> size from 0 -> 4 which results in the test failing like so:
> {{junit.framework.AssertionFailedError: Test tasks did not hit max 
> concurrency goal expected:<true> but 
> was:<false>junit.framework.AssertionFailedError: Test tasks did not hit max 
> concurrency goal expected:<true> but was:<false> at 
> org.apache.cassandra.concurrent.SEPExecutorTest.assertMaxTaskConcurrency(SEPExecutorTest.java:198)
>  at 
> org.apache.cassandra.concurrent.SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest(SEPExecutorTest.java:132)}}
> I can hit this issue maybe 2/3 times for every 100 invocations of the unit 
> test.
> The issue that causes the failure is that if tasks are still enqueued when 
> the maximum pool size is set to zero and if all of the SEPWorker threads 
> enter the STOP state before the pool size is bumped to 4, then no SEPWorker 
> threads will be spun up to service the task queue. This causes the above 
> error.
> Why don't we spin up SEPWorker threads when enqueing tasks? Because of the 
> guard logic in addTask: 
> [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java#L113,L121]
> In this scenario taskPermits will not be zero (because we have tasks on the 
> queue) so we never call {{maybeStartSpinningWorker()}}.
> A trick to make this issue much easier to hit is to insert a 
> {{Thread.sleep(500)}} immediately after setting the pool size to zero. This 
> has the effect of guaranteeing that all SEPWorker threads will be STOP'd 
> before enqueueing more work.
> Here's a fix that attempts to spin up an SEPWorker whenever we grow the 
> number of work permits: 
> https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero

Reply via email to