[jira] [Comment Edited] (CASSANDRA-16181) 4.0 Quality: Replication Test Audit

Caleb Rackliffe (Jira) Mon, 01 Mar 2021 19:58:07 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293329#comment-17293329
 ]


Caleb Rackliffe edited comment on CASSANDRA-16181 at 3/2/21, 3:57 AM:
----------------------------------------------------------------------

Looking through the actual logs for {{shouldStreamHintsDuringDecomission}}, the 
migration coordinator from one of the nodes is trying to submit a schema pull 
on the MIGRATION stage, but it doesn't actually check to see if the stage 
executor is shut down, and it might be as a result of the decommission. 
({{StorageService#decommission()}} shuts down all the stage executors.)

{noformat}
ERROR [node1_isolatedExecutor:1] node1 2021-02-15 19:35:36,284 
CassandraDaemon.java:579 - Exception in thread 
Thread[node1_NonPeriodicTasks:1,5,node1] 
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut 
down at 
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:72)
 
at 
java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
 
at 
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
 
at 
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.execute(DebuggableThreadPoolExecutor.java:176)
 
at 
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
 at org.apache.cassandra.concurrent.Stage.submit(Stage.java:129) 
at 
org.apache.cassandra.schema.MigrationCoordinator.lambda$scheduleSchemaPull$2(MigrationCoordinator.java:362)
 
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 
at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 
at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}

This looks like a general issue with pool shutdown order during a decommission. 
[~e.dimitrova] [~adelapena] If this makes sense to you, I think the best option 
here would be to create a separate Jira where we can make sure 
{{ScheduledExecutors.nonPeriodicTasks}} is shut down in 
{{StorageService#decommission()}} finalization, just like it is in 
{{StorageService#drain()}}. Assuming, of course, that we do that before we call 
{{Stage.shutdownNow()}}, it shouldn't be possible for a delayed schema pull to 
sneak onto an already-closed MIGRATION stage executor.

It sort of reminds me of CASSANDRA-11062, although I don't think we have to hit 
{{setExecuteExistingDelayedTasksAfterShutdownPolicy()}} on the executor if we 
wait for termination like we do for {{drain()}}}.


was (Author: maedhroz):
Looking through the actual logs for {{shouldStreamHintsDuringDecomission}}, the 
migration coordinator from one of the nodes is trying to submit a schema pull 
on the MIGRATION stage, but it doesn't actually check to see if the stage 
executor is shut down, and it might be as a result of the decommission. 
({{StorageService#decommission()}} shuts down all the stage executors.)

{noformat}
ERROR [node1_isolatedExecutor:1] node1 2021-02-15 19:35:36,284 
CassandraDaemon.java:579 - Exception in thread 
Thread[node1_NonPeriodicTasks:1,5,node1] 
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut 
down at 
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:72)
 
at 
java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
 
at 
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
 
at 
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.execute(DebuggableThreadPoolExecutor.java:176)
 
at 
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
 at org.apache.cassandra.concurrent.Stage.submit(Stage.java:129) 
at 
org.apache.cassandra.schema.MigrationCoordinator.lambda$scheduleSchemaPull$2(MigrationCoordinator.java:362)
 
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 
at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
 
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 
at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}

This looks like a general issue with pool shutdown order during a decommission. 
[~e.dimitrova] [~adelapena] If this makes sense to you, I think the best option 
here would be to create a separate Jira where we can make sure 
{{ScheduledExecutors.nonPeriodicTasks}} is shut down in 
{{StorageService#decommission()}} finalization, just like it is in 
{{StorageService#drain()}}. Assuming, of course, that we do that before we call 
{{Stage.shutdownNow()}}, it shouldn't be possible for a delayed schema pull to 
sneak onto an already-closed MIGRATION stage executor.

> 4.0 Quality: Replication Test Audit
> -----------------------------------
>
>                 Key: CASSANDRA-16181
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16181
>             Project: Cassandra
>          Issue Type: Task
>          Components: Test/unit
>            Reporter: Andres de la Peña
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>             Fix For: 4.0-rc
>
>          Time Spent: 11h 20m
>  Remaining Estimate: 0h
>
> This is a subtask of CASSANDRA-15579 focusing on replication.
> I think that the main reference dtest for this is 
> [replication_test.py|https://github.com/apache/cassandra-dtest/blob/master/replication_test.py].
>  We should identify which other tests cover this and identify what should be 
> extended, similarly to what has been done with CASSANDRA-15977.
> The doc 
> [here|https://docs.google.com/document/d/1yPbquhAALIkkTRMmyOv5cceD5N5sPFMB1O4iOd3O7FM/edit?usp=sharing]
>  describes the existing state of testing around replication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-16181) 4.0 Quality: Replication Test Audit

Reply via email to