Avraham Kalvo created CASSANDRA-15152:
-----------------------------------------
Summary: Batch Log - Mutation too large while bootstrapping a
newly added node
Key: CASSANDRA-15152
URL: https://issues.apache.org/jira/browse/CASSANDRA-15152
Project: Cassandra
Issue Type: Bug
Components: Consistency/Batch Log
Reporter: Avraham Kalvo
Scaling our six nodes cluster by three more nodes, we came upon behavior in
which bootstrap appears hung under `UJ` (two previously added were joined
within approximately 2.5 hours).
Examining the logs the following became apparent shortly after the bootstrap
process has commenced for this node:
```
ERROR [BatchlogTasks:1] 2019-06-05 14:43:46,508 CassandraDaemon.java:207 -
Exception in thread Thread[BatchlogTasks:1,5,main]
java.lang.IllegalArgumentException: Mutation of 108035175 bytes is too large
for the maximum size of 16777216
at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:256)
~[apache-cassandra-3.0.10.jar:3.0.10]
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:520)
~[apache-cassandra-3.0.10.jar:3.0.10]
at
org.apache.cassandra.db.Keyspace.applyNotDeferrable(Keyspace.java:399)
~[apache-cassandra-3.0.10.jar:3.0.10]
at org.apache.cassandra.db.Mutation.apply(Mutation.java:213)
~[apache-cassandra-3.0.10.jar:3.0.10]
at org.apache.cassandra.db.Mutation.apply(Mutation.java:227)
~[apache-cassandra-3.0.10.jar:3.0.10]
at
org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.sendSingleReplayMutation(BatchlogManager.java:427)
~[apache-cassandra-3.0.10.jar:3.0.10]
at
org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.sendReplays(BatchlogManager.java:402)
~[apache-cassandra-3.0.10.jar:3.0.10]
at
org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.replay(BatchlogManager.java:318)
~[apache-cassandra-3.0.10.jar:3.0.10]
at
org.apache.cassandra.batchlog.BatchlogManager.processBatchlogEntries(BatchlogManager.java:238)
~[apache-cassandra-3.0.10.jar:3.0.10]
at
org.apache.cassandra.batchlog.BatchlogManager.replayFailedBatches(BatchlogManager.java:207)
~[apache-cassandra-3.0.10.jar:3.0.10]
at
org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118)
~[apache-cassandra-3.0.10.jar:3.0.10]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[na:1.8.0_201]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
[na:1.8.0_201]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
[na:1.8.0_201]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
[na:1.8.0_201]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[na:1.8.0_201]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[na:1.8.0_201]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
```
And since then, repeating itself in the logs.
We decided to discard the newly added apparently still joining node by doing
the following:
1. at first - simply restarting it, which resulted in it starting up apparently
normally
2. then - decommission it by issuing `nodetool decommission`, this took long
(over 2.5 hours) and eventually was terminated by issuing `nodetool removenode`
3. node removal was hung on a specific token, which led us to complete it by
force.
4. forcing the node removal has generated a corruption with one of the
`system.batches` table SSTables, which was removed (backed up) from its
underlying data dir as mitigation (78MB worth)
5. cluster-wide repair was run
6. `Mutation too large` error is now repeating itself in three different
permutations (alerted sizes) under three different nodes (our standard
replication factor is of three)
We're not sure whether we're hitting
https://issues.apache.org/jira/browse/CASSANDRA-11670 or not, as it's said to
be resolved in our current version of 3.0.10.
Still would like to verify what's the root cause for this? as we need to make
clear whether we are to expect this happening in production environments.
How would you recommend verifying to which keyspace.table does this mutation
belong to?
Thanks.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]