[jira] [Created] (CASSANDRA-15152) Batch Log - Mutation too large while bootstrapping a newly added node

Avraham Kalvo (JIRA) Sun, 09 Jun 2019 22:36:30 -0700

Avraham Kalvo created CASSANDRA-15152:
-----------------------------------------


             Summary: Batch Log - Mutation too large while bootstrapping a 
newly added node
                 Key: CASSANDRA-15152
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15152
             Project: Cassandra
          Issue Type: Bug
          Components: Consistency/Batch Log
            Reporter: Avraham Kalvo


Scaling our six nodes cluster by three more nodes, we came upon behavior in 
which bootstrap appears hung under `UJ` (two previously added were joined 
within approximately 2.5 hours).

Examining the logs the following became apparent shortly after the bootstrap 
process has commenced for this node:
```
ERROR [BatchlogTasks:1] 2019-06-05 14:43:46,508 CassandraDaemon.java:207 - 
Exception in thread Thread[BatchlogTasks:1,5,main]
java.lang.IllegalArgumentException: Mutation of 108035175 bytes is too large 
for the maximum size of 16777216
        at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:256) 
~[apache-cassandra-3.0.10.jar:3.0.10]
        at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:520) 
~[apache-cassandra-3.0.10.jar:3.0.10]
        at 
org.apache.cassandra.db.Keyspace.applyNotDeferrable(Keyspace.java:399) 
~[apache-cassandra-3.0.10.jar:3.0.10]
        at org.apache.cassandra.db.Mutation.apply(Mutation.java:213) 
~[apache-cassandra-3.0.10.jar:3.0.10]
        at org.apache.cassandra.db.Mutation.apply(Mutation.java:227) 
~[apache-cassandra-3.0.10.jar:3.0.10]
        at 
org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.sendSingleReplayMutation(BatchlogManager.java:427)
 ~[apache-cassandra-3.0.10.jar:3.0.10]
        at 
org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.sendReplays(BatchlogManager.java:402)
 ~[apache-cassandra-3.0.10.jar:3.0.10]
        at 
org.apache.cassandra.batchlog.BatchlogManager$ReplayingBatch.replay(BatchlogManager.java:318)
 ~[apache-cassandra-3.0.10.jar:3.0.10]
        at 
org.apache.cassandra.batchlog.BatchlogManager.processBatchlogEntries(BatchlogManager.java:238)
 ~[apache-cassandra-3.0.10.jar:3.0.10]
        at 
org.apache.cassandra.batchlog.BatchlogManager.replayFailedBatches(BatchlogManager.java:207)
 ~[apache-cassandra-3.0.10.jar:3.0.10]
        at 
org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118)
 ~[apache-cassandra-3.0.10.jar:3.0.10]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[na:1.8.0_201]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
[na:1.8.0_201]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 [na:1.8.0_201]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 [na:1.8.0_201]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[na:1.8.0_201]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[na:1.8.0_201]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
```

And since then, repeating itself in the logs.

We decided to discard the newly added apparently still joining node by doing 
the following:
1. at first - simply restarting it, which resulted in it starting up apparently 
normally 
2. then - decommission it by issuing `nodetool decommission`, this took long 
(over 2.5 hours) and eventually was terminated by issuing `nodetool removenode`
3. node removal was hung on a specific token, which led us to complete it by 
force.
4. forcing the node removal has generated a corruption with one of the 
`system.batches` table SSTables, which was removed (backed up) from its 
underlying data dir as mitigation (78MB worth)
5. cluster-wide repair was run
6. `Mutation too large` error is now repeating itself in three different 
permutations (alerted sizes) under three different nodes (our standard 
replication factor is of three)

We're not sure whether we're hitting 
https://issues.apache.org/jira/browse/CASSANDRA-11670 or not, as it's said to 
be resolved in our current version of 3.0.10.
Still would like to verify what's the root cause for this? as we need to make 
clear whether we are to expect this happening in production environments.

How would you recommend verifying to which keyspace.table does this mutation 
belong to?

Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (CASSANDRA-15152) Batch Log - Mutation too large while bootstrapping a newly added node

Reply via email to