Re: Node crashes on repair (Cassandra 3.11.1)
I think we’ve hit the Bug described here: https://issues.apache.org/jira/browse/CASSANDRA-14096 Regards, Christian Von: Christian Lorenz <christian.lor...@webtrekk.com> Antworten an: "user@cassandra.apache.org" <user@cassandra.apache.org> Datum: Freitag, 1. Dezember 2017 um 10:04 An: "user@cassandra.apache.org" <user@cassandra.apache.org> Betreff: Re: Node crashes on repair (Cassandra 3.11.1) Hi Jeff, the repairs worked fine before on version 3.9. I noticed that the validation tasks when doing a repair are not bound anymore to the concurrent_compactors value. Is this maybe too much pressure for the node to manage, so it gets stressed too much? Greetings, Christian Von: Jeff Jirsa <jji...@gmail.com> Antworten an: "user@cassandra.apache.org" <user@cassandra.apache.org> Datum: Donnerstag, 30. November 2017 um 19:46 An: cassandra <user@cassandra.apache.org> Betreff: Re: Node crashes on repair (Cassandra 3.11.1) That was worded poorly. The depth has a max depth of 20, the tree is the same size for any range > 2**20. On Thu, Nov 30, 2017 at 10:43 AM, Jeff Jirsa <jji...@gmail.com<mailto:jji...@gmail.com>> wrote: Merkle trees have a fixed size/depth (2**20), so it’s not that, but it could be timing out elsewhere (or still running validation or something) -- Jeff Jirsa On Nov 30, 2017, at 10:12 AM, Javier Canillas <javier.canil...@gmail.com<mailto:javier.canil...@gmail.com>> wrote: Christian, I'm not an expert, but maybe the merkle tree is too big to transfer between nodes and that's why it times out. How many nodes do you have and what's the size of the keyspace? Have you ever done a successfully repair before? Cassandra reaper does repair based on tokenrange (or even part of it), that's why it can manage to require a small merkle tree. Regards, Javier. 2017-11-30 6:48 GMT-03:00 Christian Lorenz <christian.lor...@webtrekk.com<mailto:christian.lor...@webtrekk.com>>: Hello, after updating our cluster to Cassandra 3.11.1 (previously 3.9) running a ‘nodetool repair –full’ leads to the node crashing. Logfile showed the following Exception: ERROR [ReadRepairStage:36] 2017-11-30 07:42:06,439 CassandraDaemon.java:228 - Exception in thread Thread[ReadRepairStage:36,5,main] org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:199) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:175) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:92) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:76) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.11.1.jar:3.11.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_151] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_151] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) ~[apache-cassandra-3.11.1.jar:3.11.1] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151] The node datasize is ~270GB. A repair with Cassandra reaper works fine though. Any idea why this could be happening? Regards, Christian
Re: Node crashes on repair (Cassandra 3.11.1)
Hi Jeff, the repairs worked fine before on version 3.9. I noticed that the validation tasks when doing a repair are not bound anymore to the concurrent_compactors value. Is this maybe too much pressure for the node to manage, so it gets stressed too much? Greetings, Christian Von: Jeff Jirsa <jji...@gmail.com> Antworten an: "user@cassandra.apache.org" <user@cassandra.apache.org> Datum: Donnerstag, 30. November 2017 um 19:46 An: cassandra <user@cassandra.apache.org> Betreff: Re: Node crashes on repair (Cassandra 3.11.1) That was worded poorly. The depth has a max depth of 20, the tree is the same size for any range > 2**20. On Thu, Nov 30, 2017 at 10:43 AM, Jeff Jirsa <jji...@gmail.com<mailto:jji...@gmail.com>> wrote: Merkle trees have a fixed size/depth (2**20), so it’s not that, but it could be timing out elsewhere (or still running validation or something) -- Jeff Jirsa On Nov 30, 2017, at 10:12 AM, Javier Canillas <javier.canil...@gmail.com<mailto:javier.canil...@gmail.com>> wrote: Christian, I'm not an expert, but maybe the merkle tree is too big to transfer between nodes and that's why it times out. How many nodes do you have and what's the size of the keyspace? Have you ever done a successfully repair before? Cassandra reaper does repair based on tokenrange (or even part of it), that's why it can manage to require a small merkle tree. Regards, Javier. 2017-11-30 6:48 GMT-03:00 Christian Lorenz <christian.lor...@webtrekk.com<mailto:christian.lor...@webtrekk.com>>: Hello, after updating our cluster to Cassandra 3.11.1 (previously 3.9) running a ‘nodetool repair –full’ leads to the node crashing. Logfile showed the following Exception: ERROR [ReadRepairStage:36] 2017-11-30 07:42:06,439 CassandraDaemon.java:228 - Exception in thread Thread[ReadRepairStage:36,5,main] org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:199) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:175) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:92) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:76) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.11.1.jar:3.11.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_151] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_151] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) ~[apache-cassandra-3.11.1.jar:3.11.1] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151] The node datasize is ~270GB. A repair with Cassandra reaper works fine though. Any idea why this could be happening? Regards, Christian
Re: Node crashes on repair (Cassandra 3.11.1)
That was worded poorly. The depth has a max depth of 20, the tree is the same size for any range > 2**20. On Thu, Nov 30, 2017 at 10:43 AM, Jeff Jirsawrote: > Merkle trees have a fixed size/depth (2**20), so it’s not that, but it > could be timing out elsewhere (or still running validation or something) > > -- > Jeff Jirsa > > > On Nov 30, 2017, at 10:12 AM, Javier Canillas > wrote: > > Christian, > > I'm not an expert, but maybe the merkle tree is too big to transfer > between nodes and that's why it times out. How many nodes do you have and > what's the size of the keyspace? Have you ever done a successfully repair > before? > > Cassandra reaper does repair based on tokenrange (or even part of it), > that's why it can manage to require a small merkle tree. > > Regards, > > Javier. > > 2017-11-30 6:48 GMT-03:00 Christian Lorenz > : > >> Hello, >> >> >> >> after updating our cluster to Cassandra 3.11.1 (previously 3.9) running a >> ‘nodetool repair –full’ leads to the node crashing. >> >> Logfile showed the following Exception: >> >> ERROR [ReadRepairStage:36] 2017-11-30 07:42:06,439 >> CassandraDaemon.java:228 - Exception in thread >> Thread[ReadRepairStage:36,5,main] >> >> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed >> out - received only 0 responses. >> >> at >> org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:199) >> ~[apache-cassandra-3.11.1.jar:3.11.1] >> >> at org.apache.cassandra.db.partitions.UnfilteredPartitionIterat >> ors$2.close(UnfilteredPartitionIterators.java:175) >> ~[apache-cassandra-3.11.1.jar:3.11.1] >> >> at >> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:92) >> ~[apache-cassandra-3.11.1.jar:3.11.1] >> >> at >> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:76) >> ~[apache-cassandra-3.11.1.jar:3.11.1] >> >> at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThr >> ow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.1.jar:3.11.1] >> >> at >> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) >> ~[apache-cassandra-3.11.1.jar:3.11.1] >> >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> ~[na:1.8.0_151] >> >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> ~[na:1.8.0_151] >> >> at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$ >> threadLocalDeallocator$0(NamedThreadFactory.java:81) >> ~[apache-cassandra-3.11.1.jar:3.11.1] >> >> at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151] >> >> >> >> The node datasize is ~270GB. A repair with Cassandra reaper works fine >> though. >> >> >> >> Any idea why this could be happening? >> >> >> >> Regards, >> >> Christian >> > >
Re: Node crashes on repair (Cassandra 3.11.1)
Merkle trees have a fixed size/depth (2**20), so it’s not that, but it could be timing out elsewhere (or still running validation or something) -- Jeff Jirsa > On Nov 30, 2017, at 10:12 AM, Javier Canillas> wrote: > > Christian, > > I'm not an expert, but maybe the merkle tree is too big to transfer between > nodes and that's why it times out. How many nodes do you have and what's the > size of the keyspace? Have you ever done a successfully repair before? > > Cassandra reaper does repair based on tokenrange (or even part of it), that's > why it can manage to require a small merkle tree. > > Regards, > > Javier. > > 2017-11-30 6:48 GMT-03:00 Christian Lorenz : >> Hello, >> >> >> >> after updating our cluster to Cassandra 3.11.1 (previously 3.9) running a >> ‘nodetool repair –full’ leads to the node crashing. >> >> Logfile showed the following Exception: >> >> ERROR [ReadRepairStage:36] 2017-11-30 07:42:06,439 CassandraDaemon.java:228 >> - Exception in thread Thread[ReadRepairStage:36,5,main] >> >> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - >> received only 0 responses. >> >> at >> org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:199) >> ~[apache-cassandra-3.11.1.jar:3.11.1] >> >> at >> org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:175) >> ~[apache-cassandra-3.11.1.jar:3.11.1] >> >> at >> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:92) >> ~[apache-cassandra-3.11.1.jar:3.11.1] >> >> at >> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:76) >> ~[apache-cassandra-3.11.1.jar:3.11.1] >> >> at >> org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) >> ~[apache-cassandra-3.11.1.jar:3.11.1] >> >> at >> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) >> ~[apache-cassandra-3.11.1.jar:3.11.1] >> >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> ~[na:1.8.0_151] >> >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> ~[na:1.8.0_151] >> >> at >> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) >> ~[apache-cassandra-3.11.1.jar:3.11.1] >> >> at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151] >> >> >> >> The node datasize is ~270GB. A repair with Cassandra reaper works fine >> though. >> >> >> >> Any idea why this could be happening? >> >> >> >> Regards, >> >> Christian >> >
Re: Node crashes on repair (Cassandra 3.11.1)
Christian, I'm not an expert, but maybe the merkle tree is too big to transfer between nodes and that's why it times out. How many nodes do you have and what's the size of the keyspace? Have you ever done a successfully repair before? Cassandra reaper does repair based on tokenrange (or even part of it), that's why it can manage to require a small merkle tree. Regards, Javier. 2017-11-30 6:48 GMT-03:00 Christian Lorenz: > Hello, > > > > after updating our cluster to Cassandra 3.11.1 (previously 3.9) running a > ‘nodetool repair –full’ leads to the node crashing. > > Logfile showed the following Exception: > > ERROR [ReadRepairStage:36] 2017-11-30 07:42:06,439 > CassandraDaemon.java:228 - Exception in thread Thread[ReadRepairStage:36,5, > main] > > org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out > - received only 0 responses. > > at org.apache.cassandra.service.DataResolver$ > RepairMergeListener.close(DataResolver.java:199) > ~[apache-cassandra-3.11.1.jar:3.11.1] > > at org.apache.cassandra.db.partitions. > UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:175) > ~[apache-cassandra-3.11.1.jar:3.11.1] > > at > org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:92) > ~[apache-cassandra-3.11.1.jar:3.11.1] > > at > org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:76) > ~[apache-cassandra-3.11.1.jar:3.11.1] > > at > org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) > ~[apache-cassandra-3.11.1.jar:3.11.1] > > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > ~[apache-cassandra-3.11.1.jar:3.11.1] > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > ~[na:1.8.0_151] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > ~[na:1.8.0_151] > > at org.apache.cassandra.concurrent.NamedThreadFactory. > lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > ~[apache-cassandra-3.11.1.jar:3.11.1] > > at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151] > > > > The node datasize is ~270GB. A repair with Cassandra reaper works fine > though. > > > > Any idea why this could be happening? > > > > Regards, > > Christian >
Node crashes on repair (Cassandra 3.11.1)
Hello, after updating our cluster to Cassandra 3.11.1 (previously 3.9) running a ‘nodetool repair –full’ leads to the node crashing. Logfile showed the following Exception: ERROR [ReadRepairStage:36] 2017-11-30 07:42:06,439 CassandraDaemon.java:228 - Exception in thread Thread[ReadRepairStage:36,5,main] org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:199) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:175) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:92) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:76) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50) ~[apache-cassandra-3.11.1.jar:3.11.1] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.11.1.jar:3.11.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_151] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_151] at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) ~[apache-cassandra-3.11.1.jar:3.11.1] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_151] The node datasize is ~270GB. A repair with Cassandra reaper works fine though. Any idea why this could be happening? Regards, Christian