[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero

Jon Meredith (Jira) Tue, 18 May 2021 15:27:04 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347196#comment-17347196
 ]


Jon Meredith commented on CASSANDRA-16668:
------------------------------------------

+1 from me on 071516d29e41da9924af24e8002822d3c6af0e01 and 
b4f43608c9a8db23a622608804d95629616a66da . Thanks so much for fixing it and 
updating the tests. 

I think it fixes the issue you found, and I don't think it is possible to 
create an unbounded number of threads in the shared pool by frequently calling 
{{setMaximumPoolSize}} as it only calls schedule if there are no spinning 
threads, and tries to satisfy itself for work from the parked/spinning pools 
first.

I'm slightly concerned about the fixed duration sleep() in the new test given 
how much variability there is in CI infrastructure timing. Ideally the test 
would verify that all of the SEPWorkers were parked (and none spinning), but 
there's not any straightforward way I can see to do that without refactoring. 
Worst case is the test doesn't always cover the all-SEPWorkers parked case 
every time, but seems unlikely to fail and create a new flaky test.

On the {{shutdownTest} failures I agree it is unlikely related to the change. 
Perhaps 100 milliseconds is not long enough for the thread to notice it is 
being shutdown and exit for the join call. Threads should request a nano sleep 
for at most 10ms, but are not guaranteed to be awoken on a busy CI server and 
may exceed the test deadline. What do you think about an increase from 100 
milliseconds up to 1 second to see if it improves things?


> Intermittent failure of 
> SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race 
> condition when shrinking maximum pool size to zero
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16668
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16668
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Other
>            Reporter: Matt Fleming
>            Assignee: Matt Fleming
>            Priority: Normal
>             Fix For: 4.0-rc
>
>
> A difficult-to-hit race condition exists in 
> changingMaxWorkersMeetsConcurrencyGoalsTest when changing the maximum pool 
> size from 0 -> 4 which results in the test failing like so:
> {{junit.framework.AssertionFailedError: Test tasks did not hit max 
> concurrency goal expected:<true> but 
> was:<false>junit.framework.AssertionFailedError: Test tasks did not hit max 
> concurrency goal expected:<true> but was:<false> at 
> org.apache.cassandra.concurrent.SEPExecutorTest.assertMaxTaskConcurrency(SEPExecutorTest.java:198)
>  at 
> org.apache.cassandra.concurrent.SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest(SEPExecutorTest.java:132)}}
> I can hit this issue maybe 2/3 times for every 100 invocations of the unit 
> test.
> The issue that causes the failure is that if tasks are still enqueued when 
> the maximum pool size is set to zero and if all of the SEPWorker threads 
> enter the STOP state before the pool size is bumped to 4, then no SEPWorker 
> threads will be spun up to service the task queue. This causes the above 
> error.
> Why don't we spin up SEPWorker threads when enqueing tasks? Because of the 
> guard logic in addTask: 
> [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java#L113,L121]
> In this scenario taskPermits will not be zero (because we have tasks on the 
> queue) so we never call {{maybeStartSpinningWorker()}}.
> A trick to make this issue much easier to hit is to insert a 
> {{Thread.sleep(500)}} immediately after setting the pool size to zero. This 
> has the effect of guaranteeing that all SEPWorker threads will be STOP'd 
> before enqueueing more work.
> Here's a fix that attempts to spin up an SEPWorker whenever we grow the 
> number of work permits: 
> https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero

Reply via email to