[
https://issues.apache.org/jira/browse/CASSANDRA-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293329#comment-17293329
]
Caleb Rackliffe edited comment on CASSANDRA-16181 at 3/2/21, 3:57 AM:
----------------------------------------------------------------------
Looking through the actual logs for {{shouldStreamHintsDuringDecomission}}, the
migration coordinator from one of the nodes is trying to submit a schema pull
on the MIGRATION stage, but it doesn't actually check to see if the stage
executor is shut down, and it might be as a result of the decommission.
({{StorageService#decommission()}} shuts down all the stage executors.)
{noformat}
ERROR [node1_isolatedExecutor:1] node1 2021-02-15 19:35:36,284
CassandraDaemon.java:579 - Exception in thread
Thread[node1_NonPeriodicTasks:1,5,node1]
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut
down at
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:72)
at
java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.execute(DebuggableThreadPoolExecutor.java:176)
at
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
at org.apache.cassandra.concurrent.Stage.submit(Stage.java:129)
at
org.apache.cassandra.schema.MigrationCoordinator.lambda$scheduleSchemaPull$2(MigrationCoordinator.java:362)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}
This looks like a general issue with pool shutdown order during a decommission.
[~e.dimitrova] [~adelapena] If this makes sense to you, I think the best option
here would be to create a separate Jira where we can make sure
{{ScheduledExecutors.nonPeriodicTasks}} is shut down in
{{StorageService#decommission()}} finalization, just like it is in
{{StorageService#drain()}}. Assuming, of course, that we do that before we call
{{Stage.shutdownNow()}}, it shouldn't be possible for a delayed schema pull to
sneak onto an already-closed MIGRATION stage executor.
It sort of reminds me of CASSANDRA-11062, although I don't think we have to hit
{{setExecuteExistingDelayedTasksAfterShutdownPolicy()}} on the executor if we
wait for termination like we do for {{drain()}}}.
was (Author: maedhroz):
Looking through the actual logs for {{shouldStreamHintsDuringDecomission}}, the
migration coordinator from one of the nodes is trying to submit a schema pull
on the MIGRATION stage, but it doesn't actually check to see if the stage
executor is shut down, and it might be as a result of the decommission.
({{StorageService#decommission()}} shuts down all the stage executors.)
{noformat}
ERROR [node1_isolatedExecutor:1] node1 2021-02-15 19:35:36,284
CassandraDaemon.java:579 - Exception in thread
Thread[node1_NonPeriodicTasks:1,5,node1]
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut
down at
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:72)
at
java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
at
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
at
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.execute(DebuggableThreadPoolExecutor.java:176)
at
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118)
at org.apache.cassandra.concurrent.Stage.submit(Stage.java:129)
at
org.apache.cassandra.schema.MigrationCoordinator.lambda$scheduleSchemaPull$2(MigrationCoordinator.java:362)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}
This looks like a general issue with pool shutdown order during a decommission.
[~e.dimitrova] [~adelapena] If this makes sense to you, I think the best option
here would be to create a separate Jira where we can make sure
{{ScheduledExecutors.nonPeriodicTasks}} is shut down in
{{StorageService#decommission()}} finalization, just like it is in
{{StorageService#drain()}}. Assuming, of course, that we do that before we call
{{Stage.shutdownNow()}}, it shouldn't be possible for a delayed schema pull to
sneak onto an already-closed MIGRATION stage executor.
> 4.0 Quality: Replication Test Audit
> -----------------------------------
>
> Key: CASSANDRA-16181
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16181
> Project: Cassandra
> Issue Type: Task
> Components: Test/unit
> Reporter: Andres de la Peña
> Assignee: Caleb Rackliffe
> Priority: Normal
> Fix For: 4.0-rc
>
> Time Spent: 11h 20m
> Remaining Estimate: 0h
>
> This is a subtask of CASSANDRA-15579 focusing on replication.
> I think that the main reference dtest for this is
> [replication_test.py|https://github.com/apache/cassandra-dtest/blob/master/replication_test.py].
> We should identify which other tests cover this and identify what should be
> extended, similarly to what has been done with CASSANDRA-15977.
> The doc
> [here|https://docs.google.com/document/d/1yPbquhAALIkkTRMmyOv5cceD5N5sPFMB1O4iOd3O7FM/edit?usp=sharing]
> describes the existing state of testing around replication.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]