[jira] [Updated] (CASSANDRA-14444) Got NPE when querying Cassandra 3.11.2
[ https://issues.apache.org/jira/browse/CASSANDRA-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kurt Greaves updated CASSANDRA-1: - Description: We just upgraded our Cassandra cluster from 2.2.6 to 3.11.2 After upgrading, we immediately got exceptions in Cassandra like this one: {code} ERROR [Native-Transport-Requests-1] 2018-05-11 17:10:21,994 QueryMessage.java:129 - Unexpected error during query java.lang.NullPointerException: null at org.apache.cassandra.dht.RandomPartitioner.getToken(RandomPartitioner.java:248) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.dht.RandomPartitioner.decorateKey(RandomPartitioner.java:92) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.config.CFMetaData.decorateKey(CFMetaData.java:666) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.service.pager.PartitionRangeQueryPager.(PartitionRangeQueryPager.java:44) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.db.PartitionRangeReadCommand.getPager(PartitionRangeReadCommand.java:268) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.cql3.statements.SelectStatement.getPager(SelectStatement.java:475) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:288) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:118) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:224) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:255) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:240) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:116) ~[apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:517) [apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:410) [apache-cassandra-3.11.2.jar:3.11.2] at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.44.Final.jar:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) [netty-all-4.0.44.Final.jar:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35) [netty-all-4.0.44.Final.jar:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:348) [netty-all-4.0.44.Final.jar:4.0.44.Final] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_171] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) [apache-cassandra-3.11.2.jar:3.11.2] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.11.2.jar:3.11.2] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_171] {code} The table schema is like: {code} CREATE TABLE example.example_table ( id bigint, hash text, json text, PRIMARY KEY (id, hash) ) WITH COMPACT STORAGE {code} The query is something like: {code} "select * from example.example_table;" // (We do know this is bad practise, and we are trying to fix that right now) {code} with fetch-size as 200, using DataStax Java driver. This table contains about 20k rows. Actually, the fix is quite simple, {code} --- a/src/java/org/apache/cassandra/service/pager/PagingState.java +++ b/src/java/org/apache/cassandra/service/pager/PagingState.java @@ -46,7 +46,7 @@ public class PagingState public PagingState(ByteBuffer partitionKey, RowMark rowMark, int remaining, int remainingInPartition) { - this.partitionKey = partitionKey; + this.partitionKey = partitionKey == null ? ByteBufferUtil.EMPTY_BYTE_BUFFER : partitionKey; this.rowMark = rowMark; this.remaining = remaining; this.remainingInPartition = remainingInPartition; {code} "partitionKey == null ? ByteBufferUtil.EMPTY_BYTE_BUFFER : partitionKey;" is in 2.2.6 and 2.2.8. But it was removed for some reason. The interesting part is that, we have: {code} public final ByteBuffer partitionKey; // Can be null for single partition queries. {code} It seems "partitionKey" could be null. Thanks a lot. was: We just upgraded our Cassandra cluster from 2.2.6 to 3.11.2 After upgrading, we immediately got exceptions in Cassandra like this one: ERROR [Native-Transport-Requests-1] 2018-05-11 17:10:21,994 QueryMessage.java:129 - Unexpected error during query java.lang.NullPointerException: null at
[jira] [Updated] (CASSANDRA-14460) ERROR : java.lang.AssertionError: null
[ https://issues.apache.org/jira/browse/CASSANDRA-14460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kurt Greaves updated CASSANDRA-14460: - Description: When I tried to ADD column to a existing table, I am getting below error. {code} WARN [MutationStage-48] 2018-02-15 09:42:27,696 AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[MutationStage-48,5,main]: {} java.lang.AssertionError: null at io.netty.util.Recycler$WeakOrderQueue.(Recycler.java:225) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.util.Recycler$DefaultHandle.recycle(Recycler.java:180) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.util.Recycler.recycle(Recycler.java:141) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at org.apache.cassandra.utils.btree.BTree$Builder.recycle(BTree.java:839) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.utils.btree.BTree$Builder.build(BTree.java:1092) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.build(PartitionUpdate.java:587) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.maybeBuild(PartitionUpdate.java:577) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.holder(PartitionUpdate.java:388) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.AbstractBTreePartition.unfilteredIterator(AbstractBTreePartition.java:177) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.AbstractBTreePartition.unfilteredIterator(AbstractBTreePartition.java:172) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.serialize(PartitionUpdate.java:779) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation$MutationSerializer.serialize(Mutation.java:393) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:249) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:585) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:462) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation.apply(Mutation.java:227) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation.apply(Mutation.java:232) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation.apply(Mutation.java:241) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.StorageProxy$8.runMayThrow(StorageProxy.java:1416) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.StorageProxy$LocalMutationRunnable.run(StorageProxy.java:2640) ~[apache-cassandra-3.10.jar:3.10] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_121] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) [apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.10.jar:3.10] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121] {code} How to fix this issue? Why does this issue popped up? Any pointers / work around solution is appreciated! was: When I tried to ADD column to a existing table, I am getting below error. WARN [MutationStage-48] 2018-02-15 09:42:27,696 AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[MutationStage-48,5,main]: {} java.lang.AssertionError: null at io.netty.util.Recycler$WeakOrderQueue.(Recycler.java:225) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.util.Recycler$DefaultHandle.recycle(Recycler.java:180) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.util.Recycler.recycle(Recycler.java:141) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at org.apache.cassandra.utils.btree.BTree$Builder.recycle(BTree.java:839) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.utils.btree.BTree$Builder.build(BTree.java:1092) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.build(PartitionUpdate.java:587) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.maybeBuild(PartitionUpdate.java:577) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.holder(PartitionUpdate.java:388) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.AbstractBTreePartition.unfilteredIterator(AbstractBTreePartition.java:177) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.AbstractBTreePartition.unfilteredIterator(AbstractBTreePartition.java:172)
[jira] [Updated] (CASSANDRA-14460) ERROR : java.lang.AssertionError: null
[ https://issues.apache.org/jira/browse/CASSANDRA-14460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mutharasan Anbarasan updated CASSANDRA-14460: - Description: When I tried to ADD column to a existing table, I am getting below error. WARN [MutationStage-48] 2018-02-15 09:42:27,696 AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[MutationStage-48,5,main]: {} java.lang.AssertionError: null at io.netty.util.Recycler$WeakOrderQueue.(Recycler.java:225) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.util.Recycler$DefaultHandle.recycle(Recycler.java:180) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.util.Recycler.recycle(Recycler.java:141) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at org.apache.cassandra.utils.btree.BTree$Builder.recycle(BTree.java:839) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.utils.btree.BTree$Builder.build(BTree.java:1092) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.build(PartitionUpdate.java:587) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.maybeBuild(PartitionUpdate.java:577) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.holder(PartitionUpdate.java:388) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.AbstractBTreePartition.unfilteredIterator(AbstractBTreePartition.java:177) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.AbstractBTreePartition.unfilteredIterator(AbstractBTreePartition.java:172) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.serialize(PartitionUpdate.java:779) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation$MutationSerializer.serialize(Mutation.java:393) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:249) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:585) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:462) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation.apply(Mutation.java:227) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation.apply(Mutation.java:232) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation.apply(Mutation.java:241) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.StorageProxy$8.runMayThrow(StorageProxy.java:1416) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.StorageProxy$LocalMutationRunnable.run(StorageProxy.java:2640) ~[apache-cassandra-3.10.jar:3.10] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_121] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) [apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.10.jar:3.10] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121] How to fix this issue? Why does this issue popped up? Any pointers / work around solution is appreciated! was: When I tried to ADD column to a exitsing table, I am getting below error. WARN [MutationStage-48] 2018-02-15 09:42:27,696 AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[MutationStage-48,5,main]: {} java.lang.AssertionError: null at io.netty.util.Recycler$WeakOrderQueue.(Recycler.java:225) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.util.Recycler$DefaultHandle.recycle(Recycler.java:180) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.util.Recycler.recycle(Recycler.java:141) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at org.apache.cassandra.utils.btree.BTree$Builder.recycle(BTree.java:839) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.utils.btree.BTree$Builder.build(BTree.java:1092) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.build(PartitionUpdate.java:587) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.maybeBuild(PartitionUpdate.java:577) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.holder(PartitionUpdate.java:388) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.AbstractBTreePartition.unfilteredIterator(AbstractBTreePartition.java:177) ~[apache-cassandra-3.10.jar:3.10] at
[jira] [Updated] (CASSANDRA-14460) ERROR : java.lang.AssertionError: null
[ https://issues.apache.org/jira/browse/CASSANDRA-14460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mutharasan Anbarasan updated CASSANDRA-14460: - Description: When I tried to ADD column to a exitsing table, I am getting below error. WARN [MutationStage-48] 2018-02-15 09:42:27,696 AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[MutationStage-48,5,main]: {} java.lang.AssertionError: null at io.netty.util.Recycler$WeakOrderQueue.(Recycler.java:225) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.util.Recycler$DefaultHandle.recycle(Recycler.java:180) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.util.Recycler.recycle(Recycler.java:141) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at org.apache.cassandra.utils.btree.BTree$Builder.recycle(BTree.java:839) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.utils.btree.BTree$Builder.build(BTree.java:1092) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.build(PartitionUpdate.java:587) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.maybeBuild(PartitionUpdate.java:577) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.holder(PartitionUpdate.java:388) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.AbstractBTreePartition.unfilteredIterator(AbstractBTreePartition.java:177) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.AbstractBTreePartition.unfilteredIterator(AbstractBTreePartition.java:172) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.serialize(PartitionUpdate.java:779) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation$MutationSerializer.serialize(Mutation.java:393) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:249) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:585) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:462) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation.apply(Mutation.java:227) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation.apply(Mutation.java:232) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation.apply(Mutation.java:241) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.StorageProxy$8.runMayThrow(StorageProxy.java:1416) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.StorageProxy$LocalMutationRunnable.run(StorageProxy.java:2640) ~[apache-cassandra-3.10.jar:3.10] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_121] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) [apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.10.jar:3.10] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121] How to fix this issue? Why does this issue popped up? Any pointers / work around solution is appreciated! was: When I tried to ADD column to a exiting table, I am getting below error. WARN [MutationStage-48] 2018-02-15 09:42:27,696 AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[MutationStage-48,5,main]: {} java.lang.AssertionError: null at io.netty.util.Recycler$WeakOrderQueue.(Recycler.java:225) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.util.Recycler$DefaultHandle.recycle(Recycler.java:180) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.util.Recycler.recycle(Recycler.java:141) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at org.apache.cassandra.utils.btree.BTree$Builder.recycle(BTree.java:839) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.utils.btree.BTree$Builder.build(BTree.java:1092) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.build(PartitionUpdate.java:587) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.maybeBuild(PartitionUpdate.java:577) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.holder(PartitionUpdate.java:388) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.AbstractBTreePartition.unfilteredIterator(AbstractBTreePartition.java:177) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.AbstractBTreePartition.unfilteredIterator(AbstractBTreePartition.java:172)
[jira] [Created] (CASSANDRA-14460) ERROR : java.lang.AssertionError: null
Mutharasan Anbarasan created CASSANDRA-14460: Summary: ERROR : java.lang.AssertionError: null Key: CASSANDRA-14460 URL: https://issues.apache.org/jira/browse/CASSANDRA-14460 Project: Cassandra Issue Type: Bug Components: CQL Reporter: Mutharasan Anbarasan Fix For: 3.10 When I tried to ADD column to a exiting table, I am getting below error. WARN [MutationStage-48] 2018-02-15 09:42:27,696 AbstractLocalAwareExecutorService.java:167 - Uncaught exception on thread Thread[MutationStage-48,5,main]: {} java.lang.AssertionError: null at io.netty.util.Recycler$WeakOrderQueue.(Recycler.java:225) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.util.Recycler$DefaultHandle.recycle(Recycler.java:180) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at io.netty.util.Recycler.recycle(Recycler.java:141) ~[netty-all-4.0.39.Final.jar:4.0.39.Final] at org.apache.cassandra.utils.btree.BTree$Builder.recycle(BTree.java:839) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.utils.btree.BTree$Builder.build(BTree.java:1092) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.build(PartitionUpdate.java:587) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.maybeBuild(PartitionUpdate.java:577) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate.holder(PartitionUpdate.java:388) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.AbstractBTreePartition.unfilteredIterator(AbstractBTreePartition.java:177) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.AbstractBTreePartition.unfilteredIterator(AbstractBTreePartition.java:172) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.serialize(PartitionUpdate.java:779) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation$MutationSerializer.serialize(Mutation.java:393) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:249) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:585) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:462) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation.apply(Mutation.java:227) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation.apply(Mutation.java:232) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.db.Mutation.apply(Mutation.java:241) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.StorageProxy$8.runMayThrow(StorageProxy.java:1416) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.service.StorageProxy$LocalMutationRunnable.run(StorageProxy.java:2640) ~[apache-cassandra-3.10.jar:3.10] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_121] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) ~[apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) [apache-cassandra-3.10.jar:3.10] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) [apache-cassandra-3.10.jar:3.10] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121] How to fix this issue? Why does this issue popped up? Any pointers / work around solution is appreciated! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14298) cqlshlib tests broken on b.a.o
[ https://issues.apache.org/jira/browse/CASSANDRA-14298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483150#comment-16483150 ] Patrick Bannister commented on CASSANDRA-14298: --- I think I need to retract my recommendation to use LC_CTYPE=C.UTF-8. I learned this weekend that the C.UTF-8 locale is somewhat specific to Debian. (It's also available on more recent versions of Fedora, as an optional add-on.) I recommended it initially because it's more internationalization friendly than picking a single language such as en_US.UTF-8. Unfortunately, since it's specific to the Debian family, I think that makes it a poor choice for testing. For the lack of a better solution, I recommend we use LC_CTYPE=en_US.UTF-8. Also - I'm working on standing up a RHEL 7.5 instance on AWS to test my work on a different environment, to make sure there aren't more hidden environmental dependencies like this. Separately, as an update on the cqlshlib porting work: my forks of cassandra and cassandra-dtest have cqlshlib3 branches with cqlshlib ported to straight Python 3, with all cqlshlib unittests and all dtest cqlsh_tests passing, except for test_describe (in test_cqlsh_output.py in the cqlshlib unit tests) and test_unusual_dates (in cqlsh_tests.py in the dtests). I still want to try to measure coverage (not sure how that's going to work with the dtests but it should be doable with the unittests), and I definitely want to test these on RHEL or some other non-Debian environment; I'll continue with that work this week. > cqlshlib tests broken on b.a.o > -- > > Key: CASSANDRA-14298 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14298 > Project: Cassandra > Issue Type: Bug > Components: Build, Testing >Reporter: Stefan Podkowinski >Assignee: Patrick Bannister >Priority: Major > Labels: cqlsh, dtest > Attachments: CASSANDRA-14298-old.txt, CASSANDRA-14298.txt, > cqlsh_tests_notes.md > > > It appears that cqlsh-tests on builds.apache.org on all branches stopped > working since we removed nosetests from the system environment. See e.g. > [here|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-trunk-cqlsh-tests/458/cython=no,jdk=JDK%201.8%20(latest),label=cassandra/console]. > Looks like we either have to make nosetests available again or migrate to > pytest as we did with dtests. Giving pytest a quick try resulted in many > errors locally, but I haven't inspected them in detail yet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-14459) DynamicEndpointSnitch should never prefer latent nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-14459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinay Chella reassigned CASSANDRA-14459: Assignee: Joseph Lynch > DynamicEndpointSnitch should never prefer latent nodes > -- > > Key: CASSANDRA-14459 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14459 > Project: Cassandra > Issue Type: Improvement > Components: Coordination >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Minor > > The DynamicEndpointSnitch has two unfortunate behaviors that allow it to > provide latent hosts as replicas: > # Loses all latency information when Cassandra restarts > # Clears latency information entirely every ten minutes (by default), > allowing global queries to be routed to _other datacenters_ (and local > queries cross racks/azs) > This means that the first few queries after restart/reset could be quite slow > compared to average latencies. I propose we solve this by resetting to the > minimum observed latency instead of completely clearing the samples and > extending the {{isLatencyForSnitch}} idea to a three state variable instead > of two, in particular {{YES}}, {{NO}}, {{MAYBE}}. This extension allows > {{EchoMessages}} and {{PingMessages}} to send {{MAYBE}} indicating that the > DS should use those measurements if it only has one or fewer samples for a > host. This fixes both problems because on process restart we send out > {{PingMessages}} / {{EchoMessages}} as part of startup, and we would reset to > effectively the RTT of the hosts (also at that point normal gossip > {{EchoMessages}} have an opportunity to add an additional latency > measurement). > This strategy also nicely deals with the "a host got slow but now it's fine" > problem that the DS resets were (afaik) designed to stop because the > {{EchoMessage}} ping latency will count only after the reset for that host. > Ping latency is a more reasonable lower bound on host latency (as opposed to > status quo of zero). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482919#comment-16482919 ] Joseph Lynch edited comment on CASSANDRA-14358 at 5/21/18 7:27 PM: --- [~alienth] that is interesting and thank you for digging so deeply! If I understand correctly during a {{drain}} the other servers are responsible for noticing the change and closing their connections within the {{[shutdown_announce_in_ms|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/Gossiper.java#L1497]}} period in [response|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/GossipShutdownVerbHandler.java#L37] to the {{GOSSIP_SHUTDOWN}} gossip state, and then the {{[markAsShutdown|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/Gossiper.java#L363-L373]}} method marks it down and forcibly convicts it. I believe that the TCP connections get closed via the {{StorageService}}'s {{onDead}} method which calls {{[onDead|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/StorageService.java#L2514]}} which calls {{[MessagingService::reset|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/MessagingService.java#L505]}} which calls {{[OutboundTcpConnection::closeSocket|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L80], which [enqueues a sentinel|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L210]}} into the backlog and then the {{[OutboundTcpConnection::run|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L253]}} method is actually supposed to close it. The {{drainedMessages}} queue is a local reference though so backlog could get something that was enqueued before the {{CLOSE_SENTINEL}} and after it as well. This seems very racey to me, in particular the reconnection logic might race with the closing logic from what I can tell as we have a 2 second window between when the clients start closing and when the server will actually stop accepting new connections (because it closes the listeners). Non stateful networks would surface the RST in the {{writeConnected}} method, but AWS is like "yea that machine isn't allowed to talk to that one" and just blackholes the RSTs... I wonder if I can reproduce this by increasing that window significantly and just sending lots of traffic. was (Author: jolynch): [~alienth] that is interesting and thank you for digging so deeply! If I understand correctly during a {{drain}} the other servers are responsible for noticing the change and closing their connections within the {{[shutdown_announce_in_ms|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/Gossiper.java#L1497]}} period in [response|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/GossipShutdownVerbHandler.java#L37] to the {{GOSSIP_SHUTDOWN}} gossip state, and then the {{[markAsShutdown|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/Gossiper.java#L363-L373]}} method marks it down and forcibly convicts it. I believe that the TCP connections get closed via the {{StorageService}}'s {{onDead}} method which calls {{[onDead|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/StorageService.java#L2514]}} which calls {{[MessagingService::reset|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/MessagingService.java#L505]}} which calls {{[OutboundTcpConnection::closeSocket|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L80], which [enqueues a sentinel|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L210]}} into the backlog and then the {{[OutboundTcpConnection::run|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L253]}} method is actually supposed to close it. The {{drainedMessages}} queue is a local reference though so backlog could get something that was enqueued before the {{CLOSE_SENTINEL}} and after it as
[jira] [Commented] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482919#comment-16482919 ] Joseph Lynch commented on CASSANDRA-14358: -- [~alienth] that is interesting and thank you for digging so deeply! If I understand correctly during a {{drain}} the other servers are responsible for noticing the change and closing their connections within the {{[shutdown_announce_in_ms|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/Gossiper.java#L1497]}} period in [response|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/GossipShutdownVerbHandler.java#L37] to the {{GOSSIP_SHUTDOWN}} gossip state, and then the {{[markAsShutdown|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/gms/Gossiper.java#L363-L373]}} method marks it down and forcibly convicts it. I believe that the TCP connections get closed via the {{StorageService}}'s {{onDead}} method which calls {{[onDead|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/service/StorageService.java#L2514]}} which calls {{[MessagingService::reset|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/MessagingService.java#L505]}} which calls {{[OutboundTcpConnection::closeSocket|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L80], which [enqueues a sentinel|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L210]}} into the backlog and then the {{[OutboundTcpConnection::run|https://github.com/apache/cassandra/blob/06b3521acdb21dd3d85902d59146b9d08ad7d752/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L253]}} method is actually supposed to close it. The {{drainedMessages}} queue is a local reference though so backlog could get something that was enqueued before the {{CLOSE_SENTINEL}} and after it as well. This seems very racey to me, in particular the reconnection logic might race with the closing logic from what I can tell as we have a 2 second window between when the clients start closing and when the server will actually stop accepting new connections (because it closes the listeners). Non stateful networks would surface the RST in the {{writeConnected}} method, but AWS is like "yea that machine isn't allowed to talk to that one" and just blackholes the RSTs... > OutboundTcpConnection can hang for many minutes when nodes restart > -- > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Major > Attachments: 10 Minute Partition.pdf > > > edit summary: This primarily impacts networks with stateful firewalls such as > AWS. I'm working on a proper patch for trunk but unfortunately it relies on > the Netty refactor in 4.0 so it will be hard to backport to previous > versions. A workaround for earlier versions is to set the > {{net.ipv4.tcp_retries2}} sysctl to ~5. This can be done with the following: > {code:java} > $ cat /etc/sysctl.d/20-cassandra-tuning.conf > net.ipv4.tcp_retries2=5 > $ # Reload all sysctls > $ sysctl --system{code} > Original Bug Report: > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket
[jira] [Updated] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Lynch updated CASSANDRA-14358: - Description: edit summary: This primarily impacts networks with stateful firewalls such as AWS. I'm working on a proper patch for trunk but unfortunately it relies on the Netty refactor in 4.0 so it will be hard to backport to previous versions. A workaround for earlier versions is to set the {{net.ipv4.tcp_retries2}} sysctl to ~5. This can be done with the following: {code:java} $ cat /etc/sysctl.d/20-cassandra-tuning.conf net.ipv4.tcp_retries2=5 $ # Reload all sysctls $ sysctl --system{code} Original Bug Report: I've been trying to debug nodes not being able to see each other during longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 2.1.x clusters for us. I think I finally have a lead. It appears that prior to trunk (with the awesome Netty refactor) we do not set socket connect timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe that this means that we could potentially block forever on {{connect}} or {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I think that the OS will protect us somewhat (and that may be what's causing the eventual timeout) but I think that given the right network conditions our {{OutboundTCPConnection}} threads can just be stuck never making any progress until the OS intervenes. I have attached some logs of such a network partition during a rolling restart where an old node in the cluster has a completely foobarred {{OutboundTcpConnection}} for ~10 minutes before finally getting a {{java.net.SocketException: Connection timed out (Write failed)}} and immediately successfully reconnecting. I conclude that the old node is the problem because the new node (the one that restarted) is sending ECHOs to the old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new node's ECHOs, but the new node is never getting the ECHO's. This appears, to me, to indicate that the old node's {{OutboundTcpConnection}} thread is just stuck and can't make any forward progress. By the time we could notice this and slap TRACE logging on, the only thing we see is ~10 minutes later a {{SocketException}} inside {{writeConnected}}'s flush and an immediate recovery. It is interesting to me that the exception happens in {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write failure}} I believe that this can't be a connection reset), because my understanding is that we should have a fully handshaked SSL connection at that point in the code. Current theory: # "New" node restarts, "Old" node calls [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433] # Old node starts [creating a new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L141] SSL socket # SSLSocket calls [createSocket|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/security/SSLFactory.java#L98], which conveniently calls connect with a default timeout of "forever". We could hang here forever until the OS kills us. # If we continue, we get to [writeConnected|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L263] which eventually calls [flush|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L341] on the output stream and also can hang forever. I think the probability is especially high when a node is restarting and is overwhelmed with SSL handshakes and such. I don't fully understand the attached traceback as it appears we are getting a {{Connection Timeout}} from a {{send}} failure (my understanding is you can only get a connection timeout prior to a send), but I think it's reasonable that we have a timeout configuration issue. I'd like to try to make Cassandra robust to networking issues like this via maybe: # Change the {{SSLSocket}} {{getSocket}} methods to provide connection timeouts of 2s (equivalent to trunk's [timeout|https://github.com/apache/cassandra/blob/11496039fb18bb45407246602e31740c56d28157/src/java/org/apache/cassandra/net/async/NettyFactory.java#L329]) # Appropriately set recv timeouts via {{SO_TIMEOUT}}, maybe something like 2 minutes (in old versions via [setSoTimeout|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#setSoTimeout-int-], in trunk via
[jira] [Updated] (CASSANDRA-14358) OutboundTcpConnection can hang for many minutes when nodes restart
[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Lynch updated CASSANDRA-14358: - Description: edit: There is a reasonably workaround on Linux, I'm working on a proper patch for trunk but unfortunately it relies on the Netty refactor there so it will be hard to backport to previous versions. The workaround for earlier versions is to set: {code: I've been trying to debug nodes not being able to see each other during longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and 2.1.x clusters for us. I think I finally have a lead. It appears that prior to trunk (with the awesome Netty refactor) we do not set socket connect timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe that this means that we could potentially block forever on {{connect}} or {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I think that the OS will protect us somewhat (and that may be what's causing the eventual timeout) but I think that given the right network conditions our {{OutboundTCPConnection}} threads can just be stuck never making any progress until the OS intervenes. I have attached some logs of such a network partition during a rolling restart where an old node in the cluster has a completely foobarred {{OutboundTcpConnection}} for ~10 minutes before finally getting a {{java.net.SocketException: Connection timed out (Write failed)}} and immediately successfully reconnecting. I conclude that the old node is the problem because the new node (the one that restarted) is sending ECHOs to the old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new node's ECHOs, but the new node is never getting the ECHO's. This appears, to me, to indicate that the old node's {{OutboundTcpConnection}} thread is just stuck and can't make any forward progress. By the time we could notice this and slap TRACE logging on, the only thing we see is ~10 minutes later a {{SocketException}} inside {{writeConnected}}'s flush and an immediate recovery. It is interesting to me that the exception happens in {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write failure}} I believe that this can't be a connection reset), because my understanding is that we should have a fully handshaked SSL connection at that point in the code. Current theory: # "New" node restarts, "Old" node calls [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433] # Old node starts [creating a new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L141] SSL socket # SSLSocket calls [createSocket|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/security/SSLFactory.java#L98], which conveniently calls connect with a default timeout of "forever". We could hang here forever until the OS kills us. # If we continue, we get to [writeConnected|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L263] which eventually calls [flush|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L341] on the output stream and also can hang forever. I think the probability is especially high when a node is restarting and is overwhelmed with SSL handshakes and such. I don't fully understand the attached traceback as it appears we are getting a {{Connection Timeout}} from a {{send}} failure (my understanding is you can only get a connection timeout prior to a send), but I think it's reasonable that we have a timeout configuration issue. I'd like to try to make Cassandra robust to networking issues like this via maybe: # Change the {{SSLSocket}} {{getSocket}} methods to provide connection timeouts of 2s (equivalent to trunk's [timeout|https://github.com/apache/cassandra/blob/11496039fb18bb45407246602e31740c56d28157/src/java/org/apache/cassandra/net/async/NettyFactory.java#L329]) # Appropriately set recv timeouts via {{SO_TIMEOUT}}, maybe something like 2 minutes (in old versions via [setSoTimeout|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#setSoTimeout-int-], in trunk via [SO_TIMEOUT|http://netty.io/4.0/api/io/netty/channel/ChannelOption.html#SO_TIMEOUT] # Since we can't set send timeouts afaik (thanks java) maybe we can have some kind of watchdog that ensures OutboundTcpConnection is making progress in its queue and if it doesn't make any
[jira] [Commented] (CASSANDRA-14457) Add a virtual table with current compactions
[ https://issues.apache.org/jira/browse/CASSANDRA-14457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482546#comment-16482546 ] Aleksey Yeschenko commented on CASSANDRA-14457: --- As for 3/5, I’m thinking (keyspace, table, id) - so that you can do a SELECT by keyspace, without the table or ALLOW FILTERING. They’d still be equally close together in cqlsh. > Add a virtual table with current compactions > > > Key: CASSANDRA-14457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14457 > Project: Cassandra > Issue Type: New Feature >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Minor > Fix For: 4.x > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14457) Add a virtual table with current compactions
[ https://issues.apache.org/jira/browse/CASSANDRA-14457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482541#comment-16482541 ] Aleksey Yeschenko commented on CASSANDRA-14457: --- bq. Can we just use "undefined" for summary redistribution with changing it to be part of key? We could use a sentinel like that, so long as it's something that isn't a legal keyspace/table name. Think 'all keyspaces' and 'all tables', with a space in-between. But I'm not sure we should even list it there, or that it should have ever been a compaction type in the first place. > Add a virtual table with current compactions > > > Key: CASSANDRA-14457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14457 > Project: Cassandra > Issue Type: New Feature >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Minor > Fix For: 4.x > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14457) Add a virtual table with current compactions
[ https://issues.apache.org/jira/browse/CASSANDRA-14457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482532#comment-16482532 ] Chris Lohfink commented on CASSANDRA-14457: --- 3/5: ((keyspace, table), id) would solve the issue I concatenated the keyspace/table together for (columns listed alphabetically in cqlsh so having them on opposite sides of row was hard to read. So I definitely will go with that. Can we just use "undefined" for summary redistribution with changing it to be part of key? > Add a virtual table with current compactions > > > Key: CASSANDRA-14457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14457 > Project: Cassandra > Issue Type: New Feature >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Minor > Fix For: 4.x > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13981) Enable Cassandra for Persistent Memory
[ https://issues.apache.org/jira/browse/CASSANDRA-13981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482490#comment-16482490 ] Tony Ruiz commented on CASSANDRA-13981: --- I am out of the office returning returning Monday May 18. For urgent matters please contact my manager: eric.kaczma...@intel.com Thanks, Tony > Enable Cassandra for Persistent Memory > --- > > Key: CASSANDRA-13981 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13981 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Preetika Tyagi >Assignee: Preetika Tyagi >Priority: Major > Fix For: 4.0 > > Attachments: in-mem-cassandra-1.0.patch, in-mem-cassandra-2.0.patch, > readme.txt, readme2_0.txt > > > Currently, Cassandra relies on disks for data storage and hence it needs data > serialization, compaction, bloom filters and partition summary/index for > speedy access of the data. However, with persistent memory, data can be > stored directly in the form of Java objects and collections, which can > greatly simplify the retrieval mechanism of the data. What we are proposing > is to make use of faster and scalable B+ tree-based data collections built > for persistent memory in Java (PCJ: https://github.com/pmem/pcj) and enable a > complete in-memory version of Cassandra, while still keeping the data > persistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13981) Enable Cassandra for Persistent Memory
[ https://issues.apache.org/jira/browse/CASSANDRA-13981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482488#comment-16482488 ] Jason Brown commented on CASSANDRA-13981: - Thanks, [~pree] and [~shylaja.koko...@intel.com], for the patches. I've been reading them, understanding the scope of the technology, and see the direction you are going. However, I'd like to propose a slightly different direction. Stepping back, the pcj library is divided into two parts: the higher-level pcj components (as used in the version of this patch as previously posted), and the lower-level API, called LLPL in the library. LLPL is much smaller than the pcj parts, and offers a direct and simple way to just write bytes into a backing array from the persistent memory. In my option this will be far more natural for the cassandra community and developers, and provides a more direct access to the storage bytes. We already have lots of serialization code, and we understand that quite well; thus I'd like to keep leveraging that lower-level thinking. We will need to write custom, non-generic data structures (like we already have for our LSM-based engine), but I only see this as complete win. We need to optimize, in every way we reasonably can, our data structures as we are a database, after all. LLPL has some rough edges wrt code optimization and we will want to modify the transaction model a bit, but I suspect the pcj authors will work with us toward that end. With this as background, I've started sketching out a direction I think we should pursue. This sketch primarily shows the direction for thinking about serialization and memory allocation using LLPL. DISCLAIMER: this code doesn't compile, is not syntactically correct, and is wholly incomplete. It should be thought of a loose blueprint (sketch!) for discussion. The sketch compromises of the following concepts: - thread per sub-range (to reduce lock contention in the data structures). This is kinda inspired by the thread-per-core notion, but on a smaller scale. ({{TreeManager}} in this patch is a rudimentary dispatch class.) - how partitions should be stored - allocate a {{MemoryRegion}} from the LLPL allocator, wrap it with a {{DataOutputPlus}}, and write as we normally would. - rough implementations of the data structures for the primary index and storing rows. A longer treatment of this topic will be in the deisgn doc (see below), but using a tree for the primary index (for partition look up) and then a map for the cql rows is the basic idea. I mostly want to show the ideas around serialization so I didn't actually implement the index nor the map - except for the leaf/entry nodes which show how the serailization/data layout fits into the data structure. - explicitly pass the transaction around on writes (instead of looking for it in a {{ThreadLocal}}, as the pcj transactions does). ||13981-sketch-1|| |[branch|https://github.com/jasobrown/cassandra/tree/13981-sketch-1]| I am proposing this sketch as a starting for for discussion, along with a forthcoming design doc to help us work out more high-level details of how cassandra as a main memory database should look. I'm working on design doc now. It will explore how we can have a pluggable storage engine implementation that allows cassandra to run as a main memory database using persistent memory, while supporting the existing behaviors of cassandra in that kind of system. > Enable Cassandra for Persistent Memory > --- > > Key: CASSANDRA-13981 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13981 > Project: Cassandra > Issue Type: New Feature > Components: Core >Reporter: Preetika Tyagi >Assignee: Preetika Tyagi >Priority: Major > Fix For: 4.0 > > Attachments: in-mem-cassandra-1.0.patch, in-mem-cassandra-2.0.patch, > readme.txt, readme2_0.txt > > > Currently, Cassandra relies on disks for data storage and hence it needs data > serialization, compaction, bloom filters and partition summary/index for > speedy access of the data. However, with persistent memory, data can be > stored directly in the form of Java objects and collections, which can > greatly simplify the retrieval mechanism of the data. What we are proposing > is to make use of faster and scalable B+ tree-based data collections built > for persistent memory in Java (PCJ: https://github.com/pmem/pcj) and enable a > complete in-memory version of Cassandra, while still keeping the data > persistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14223) Provide ability to do custom certificate validations (e.g. hostname validation, certificate revocation checks)
[ https://issues.apache.org/jira/browse/CASSANDRA-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482467#comment-16482467 ] Per Otterström commented on CASSANDRA-14223: [~jasobrown], I'm trying to understand your concern with blocking I/O. Only scenario I can think of is that several clients connect simultaneously and thereby allocate (and block) so many threads that already active connections don't get enough threads to execute requests? Not sure if that's the issue tough. Can you elaborate a bit? In your patch there is a comment on {{SSLSessionValidator.validate()}} that got me confused. "This function should not block!". I thought the point of having this separation was to allow the validator to block? If I would like to implement hostname validation using a custom {{SSLSessionValidator}} it think we need to change the signature of the {{validate()}} method to {{boolean validate(SocketChannel)}}. This change would obviously cascade to other places as well. I don't think it is possible to pull out remote peer IP/port from a {{Channel}} object. Also, I would need to find some way to get information from the certificate to compare. Is there some clever way to do that? bq. Perhaps another solution, a sort of middle ground, is to still make use of a custom TrustManager, but hand that to the netty SslContext, and then execute the TLS handshake in the netty pipeline on a different event loop group from the rest of the pipeline. That seem like a more attractive approach IMO. > Provide ability to do custom certificate validations (e.g. hostname > validation, certificate revocation checks) > -- > > Key: CASSANDRA-14223 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14223 > Project: Cassandra > Issue Type: Improvement > Components: Configuration >Reporter: Ron Blechman >Priority: Major > Labels: security > Fix For: 4.x > > Attachments: dsp.tar.gz > > > Cassandra server should be to be able do additional certificate validations, > such as hostname validatation and certificate revocation checking against > CRLs and/or using OCSP. > One approach couild be to have SSLFactory use SSLContext.getDefault() instead > of forcing the creation of a new SSLContext using SSLContext.getInstance(). > Using the default SSLContext would allow a user to plug in their own custom > SSLSocketFactory via the java.security properties file. The custom > SSLSocketFactory could create a default SSLContext that was customized to do > any extra validation such as certificate revocation, host name validation, > etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14223) Provide ability to do custom certificate validations (e.g. hostname validation, certificate revocation checks)
[ https://issues.apache.org/jira/browse/CASSANDRA-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482416#comment-16482416 ] Per Otterström commented on CASSANDRA-14223: Attached dsp.tar.gz. A minimal security provider, only containing a single service - a TrustManager with enforced hostname validation. There is a readme with some instructions on how to use it. [~ronblechman], based on what you described around your tests, I believe that you should be able to install your own TrustManager in a similar way. Bouncy Castle seem to support a similar setup: [http://www.bouncycastle.org/wiki/display/JA1/Provider+Installation] What I like about this approach, is that I can install and manage my security providers in the same way for all my Java based applications. > Provide ability to do custom certificate validations (e.g. hostname > validation, certificate revocation checks) > -- > > Key: CASSANDRA-14223 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14223 > Project: Cassandra > Issue Type: Improvement > Components: Configuration >Reporter: Ron Blechman >Priority: Major > Labels: security > Fix For: 4.x > > Attachments: dsp.tar.gz > > > Cassandra server should be to be able do additional certificate validations, > such as hostname validatation and certificate revocation checking against > CRLs and/or using OCSP. > One approach couild be to have SSLFactory use SSLContext.getDefault() instead > of forcing the creation of a new SSLContext using SSLContext.getInstance(). > Using the default SSLContext would allow a user to plug in their own custom > SSLSocketFactory via the java.security properties file. The custom > SSLSocketFactory could create a default SSLContext that was customized to do > any extra validation such as certificate revocation, host name validation, > etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14458) Add virtual table to list active connections
[ https://issues.apache.org/jira/browse/CASSANDRA-14458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko updated CASSANDRA-14458: -- Reviewer: Aleksey Yeschenko Fix Version/s: 4.x > Add virtual table to list active connections > > > Key: CASSANDRA-14458 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14458 > Project: Cassandra > Issue Type: New Feature >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Minor > Fix For: 4.x > > > List all active connections in virtual table like: > {code:sql} > cqlsh:system> select * from system_views.clients ; > > client_address | cipher | driver_name | driver_version | keyspace | > protocol | requests | ssl | user | version > --+---+-++--+---+--+---+---+- > /127.0.0.1:63903 | undefined | undefined | undefined | | > undefined | 13 | False | anonymous | 4 > /127.0.0.1:63904 | undefined | undefined | undefined | system | > undefined | 16 | False | anonymous | 4 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14457) Add a virtual table with current compactions
[ https://issues.apache.org/jira/browse/CASSANDRA-14457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482413#comment-16482413 ] Aleksey Yeschenko commented on CASSANDRA-14457: --- This looks good, so I only have some bikeshedding to contribute: 1. The table doesn't really represent compaction statistics, so should probably not name it compaction_stats? I know that the nodetool cmd is named compactionstats, but that was a bad name too, imo. So perhaps {{compaction_state}} and {{CompactionStateTable}}, or {{compcation_status}} and {{CompactionStatusTable}}, or even {{active_compactions}} and {{ActiveCompactionsTable}}; whichever sounds nicer to your American ear. 2. Should we perhaps use {{int}} as the type for current and total columns, instead of {{text}}? 3. Maybe don't concat the names of the keyspace and the table? Why not have them in separate columns, for easier querying? 4. We tend to lowercase enums in table, usually. Can you slap a {{toLowerCase()}} on {{task_type}} please? 5. I would prefer to have ((keyspace, table), id) or ((keyspace), table, id) as PRIMARY KEY here, personally. Upon a quick look, it seems like the only case where we don't have a keyspace/table attached to a compaction is summary redistribution, which is performed on all sstables. But it's not really a compaction, so perhaps we should exclude it from the dataset? > Add a virtual table with current compactions > > > Key: CASSANDRA-14457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14457 > Project: Cassandra > Issue Type: New Feature >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Minor > Fix For: 4.x > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14223) Provide ability to do custom certificate validations (e.g. hostname validation, certificate revocation checks)
[ https://issues.apache.org/jira/browse/CASSANDRA-14223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Per Otterström updated CASSANDRA-14223: --- Attachment: dsp.tar.gz > Provide ability to do custom certificate validations (e.g. hostname > validation, certificate revocation checks) > -- > > Key: CASSANDRA-14223 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14223 > Project: Cassandra > Issue Type: Improvement > Components: Configuration >Reporter: Ron Blechman >Priority: Major > Labels: security > Fix For: 4.x > > Attachments: dsp.tar.gz > > > Cassandra server should be to be able do additional certificate validations, > such as hostname validatation and certificate revocation checking against > CRLs and/or using OCSP. > One approach couild be to have SSLFactory use SSLContext.getDefault() instead > of forcing the creation of a new SSLContext using SSLContext.getInstance(). > Using the default SSLContext would allow a user to plug in their own custom > SSLSocketFactory via the java.security properties file. The custom > SSLSocketFactory could create a default SSLContext that was customized to do > any extra validation such as certificate revocation, host name validation, > etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14457) Add a virtual table with current compactions
[ https://issues.apache.org/jira/browse/CASSANDRA-14457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko updated CASSANDRA-14457: -- Reviewer: Aleksey Yeschenko Fix Version/s: 4.x > Add a virtual table with current compactions > > > Key: CASSANDRA-14457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14457 > Project: Cassandra > Issue Type: New Feature >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Minor > Fix For: 4.x > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org