Hi Bowen, Thanks for the reply, these have been our normal shutdowns, so we do a nodetool drain before restarting the service, so I would have thought there should not be any commtlogs
However there is these messages for one commit log, But looks like it has finished quickly and correctly: INFO [main] 2022-01-19 10:08:22,811 CommitLog.java:173 - Replaying /var/lib/cassandra/commitlog/CommitLog-7-1642094921295.log WARN [main] 2022-01-19 10:08:22,839 CommitLogReplayer.java:305 - Origin of 2 sstables is unknown or doesn't match the local node; commitLogIntervals for them were ignored Repeated about 10 times WARN [main] 2022-01-19 10:08:22,842 CommitLogReplayer.java:305 - Origin of 3 sstables is unknown or doesn't match the local node; commitLogIntervals for them were ignored INFO [main] 2022-01-19 10:08:22,853 CommitLogReader.java:256 - Finished reading /var/lib/cassandra/commitlog/CommitLog-7-1642094921295.log INFO [main] 2022-01-19 10:08:22,882 CommitLog.java:175 - Log replay complete, 0 replayed mutations Thanks Paul > On 19 Jan 2022, at 13:03, Bowen Song <bo...@bso.ng> wrote: > > Nothing obvious from the logs you posted. > > Generally speaking, replaying commit log is often the culprit when a node > takes a long time to start. I have seen many nodes with large memtable and > commit log size limit spending over half an hour replaying the commit log. I > usually do a "nodetool flush" before shutting down the node to help speed up > the start time if the shutdown was planned. There isn't much you can do about > unexpected shutdown, such as server crashes. When that happens, the only > reasonable thing to do is wait for the commit log replay to finish. You > should see log entries related to replaying commit logs if this is the case. > > However, if you don't find any logs related to replaying commit logs, the > cause may be completely different. > > > On 19/01/2022 11:54, Paul Chandler wrote: >> Hi all, >> >> We have upgraded a couple of clusters from 3.11.6, now we are having issues >> when we restart the nodes. >> >> The node will either hang or take 10-30 minute to restart, these are the >> last messages we have in the system.log: >> >> INFO [NonPeriodicTasks:1] 2022-01-19 10:08:23,267 FileUtils.java:545 - >> Deleting file during startup: >> /var/lib/cassandra/data/system/table_estimates-176c39cdb93d33a5a2188eb06a56f66e/nb-184-big-Summary.db >> INFO [NonPeriodicTasks:1] 2022-01-19 10:08:23,268 LogTransaction.java:240 >> - Unfinished transaction log, deleting >> /var/lib/cassandra/data/system/table_estimates-176c39cdb93d33a5a2188eb06a56f66e/nb-185-big-Data.db >> INFO [NonPeriodicTasks:1] 2022-01-19 10:08:23,268 FileUtils.java:545 - >> Deleting file during startup: >> /var/lib/cassandra/data/system/table_estimates-176c39cdb93d33a5a2188eb06a56f66e/nb-185-big-Summary.db >> INFO [NonPeriodicTasks:1] 2022-01-19 10:08:23,269 LogTransaction.java:240 >> - Unfinished transaction log, deleting >> /var/lib/cassandra/data/system/table_estimates-176c39cdb93d33a5a2188eb06a56f66e/nb-186-big-Data.db >> INFO [NonPeriodicTasks:1] 2022-01-19 10:08:23,270 FileUtils.java:545 - >> Deleting file during startup: >> /var/lib/cassandra/data/system/table_estimates-176c39cdb93d33a5a2188eb06a56f66e/nb-186-big-Summary.db >> INFO [NonPeriodicTasks:1] 2022-01-19 10:08:23,272 LogTransaction.java:240 >> - Unfinished transaction log, deleting >> /var/lib/cassandra/data/system/table_estimates-176c39cdb93d33a5a2188eb06a56f66e/nb_txn_unknowncompactiontype_bc501d00-790f-11ec-9f80-85 >> 8854746758.log >> INFO [MemtableFlushWriter:2] 2022-01-19 10:08:23,289 >> LogTransaction.java:240 - Unfinished transaction log, deleting >> /var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/nb_txn_flush_bc52dc20-790f-11ec-9f80-858854746758.log >> >> The debug log has messages from DiskBoundaryManager.java at the same time, >> then it just has the following messages:|| >> >> DEBUG [ScheduledTasks:1] 2022-01-19 10:28:09,430 SSLFactory.java:354 - >> Checking whether certificates have been updated [] >> DEBUG [ScheduledTasks:1] 2022-01-19 10:38:09,431 SSLFactory.java:354 - >> Checking whether certificates have been updated [] >> DEBUG [ScheduledTasks:1] 2022-01-19 10:48:09,431 SSLFactory.java:354 - >> Checking whether certificates have been updated [] >> DEBUG [ScheduledTasks:1] 2022-01-19 10:58:09,431 SSLFactory.java:354 - >> Checking whether certificates have been updated [] >> >> >> It seems to get worse after each restart, and then it gets to the state >> where it just hangs, then the only thing to do is to re bootstrap the node. >> >> Once I had re bootstrapped all the nodes in the cluster, I thought the >> cluster was stable, but I have now got the case where the one of the nodes >> is hanging again. >> >> Does anyone have an ideas what is causing the problems ? >> >> >> Thanks >> >> Paul Chandler >>