[jira] [Commented] (CASSANDRA-11905) Index building fails to start CFS.readOrdering when reading
[ https://issues.apache.org/jira/browse/CASSANDRA-11905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306252#comment-15306252 ] Stefania commented on CASSANDRA-11905: -- There is a typo at line 122 of {{SinglePartitionSliceCommandTest}}, it should be {{pIter.next()}} rather than {{partitionIterator.next()}}. This is causing 2 CI failures, the remaining failing tests are all unrelated known failures. +1 once the typo in the test is fixed. > Index building fails to start CFS.readOrdering when reading > --- > > Key: CASSANDRA-11905 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11905 > Project: Cassandra > Issue Type: Bug >Reporter: Sylvain Lebresne >Assignee: Sylvain Lebresne >Priority: Critical > Fix For: 3.0.x, 3.x > > > This code for indexing partition when building index in 3.0 is: > {noformat} > SinglePartitionReadCommand cmd = > SinglePartitionReadCommand.fullPartitionRead(cfs.metadata, > FBUtilities.nowInSeconds(), key); > try (OpOrder.Group opGroup = cfs.keyspace.writeOrder.start(); > UnfilteredRowIterator partition = cmd.queryMemtableAndDisk(cfs, > opGroup)) > { > cfs.indexManager.indexPartition(partition, opGroup, indexes, > cmd.nowInSec()); > } > {noformat} > which is clearly incorrect as the {{OpOrder}} that {{queryMemtableAndDisk}} > expects is the one from {{cfs.readOrdering}}, not the one for writes on the > keyspace. > This wasn't a problem prior to 3.0 as the similar code was using the pager, > which ended up properly taking the read {{OpOrder}} internally but I messed > this up in CASSANDRA-8099. > Thanks to [~Stefania] for pointing that out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation
[ https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-11738: --- Assignee: (was: Jonathan Ellis) > Re-think the use of Severity in the DynamicEndpointSnitch calculation > - > > Key: CASSANDRA-11738 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11738 > Project: Cassandra > Issue Type: Improvement >Reporter: Jeremiah Jordan > Fix For: 3.x > > > CASSANDRA-11737 was opened to allow completely disabling the use of severity > in the DynamicEndpointSnitch calculation, but that is a pretty big hammer. > There is probably something we can do to better use the score. > The issue seems to be that severity is given equal weight with latency in the > current code, also that severity is only based on disk io. If you have a > node that is CPU bound on something (say catching up on LCS compactions > because of bootstrap/repair/replace) the IO wait can be low, but the latency > to the node is high. > Some ideas I had are: > 1. Allowing a yaml parameter to tune how much impact the severity score has > in the calculation. > 2. Taking CPU load into account as well as IO Wait (this would probably help > in the cases I have seen things go sideways) > 3. Move the -D from CASSANDRA-11737 to being a yaml level setting > 4. Go back to just relying on Latency and get rid of severity all together. > Now that we have rapid read protection, maybe just using latency is enough, > as it can help where the predictive nature of IO wait would have been useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11738) Re-think the use of Severity in the DynamicEndpointSnitch calculation
[ https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306185#comment-15306185 ] Jonathan Ellis commented on CASSANDRA-11738: bq. a measured latency can be influenced by a badly timed GC (e.g. G1 running with a 500ms goal that sometimes has "valid" STW phases of up to 300/400ms). True enough, but that's actually okay for our use case here. We prefer to use *actual* latency, so we only need the *estimate* when there is no actual available, i.e., when other coordinators stopped routing requests to us because the actual was high. The job of the estimate is to let the other coordinators know (when it gets low again) that they can resume sending us requests. bq. Compactions and GCs can kick in every time anyway. Right, but I see these as two different categories. GC STW lasts for fractions of a second, while compaction can last minutes or even hours for a large STCS job. So trying to route around GC is futile, but routing around compaction is not. bq. Just as an idea: a node can request a ping-response from a node it sends a request to If possible, I'd prefer to make this follow the existing "push" paradigm, via gossip, for simplicity. I had two ideas along those lines: # Give up on computing a latency number in favor of other "load" metrics. The coordinator can then extrapolate latency by comparing that number to other nodes with similar load. # Just brute force it: run SELECT * LIMIT 1 every 10s and report the latency averaged across a sample of user tables. > Re-think the use of Severity in the DynamicEndpointSnitch calculation > - > > Key: CASSANDRA-11738 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11738 > Project: Cassandra > Issue Type: Improvement >Reporter: Jeremiah Jordan >Assignee: Jonathan Ellis > Fix For: 3.x > > > CASSANDRA-11737 was opened to allow completely disabling the use of severity > in the DynamicEndpointSnitch calculation, but that is a pretty big hammer. > There is probably something we can do to better use the score. > The issue seems to be that severity is given equal weight with latency in the > current code, also that severity is only based on disk io. If you have a > node that is CPU bound on something (say catching up on LCS compactions > because of bootstrap/repair/replace) the IO wait can be low, but the latency > to the node is high. > Some ideas I had are: > 1. Allowing a yaml parameter to tune how much impact the severity score has > in the calculation. > 2. Taking CPU load into account as well as IO Wait (this would probably help > in the cases I have seen things go sideways) > 3. Move the -D from CASSANDRA-11737 to being a yaml level setting > 4. Go back to just relying on Latency and get rid of severity all together. > Now that we have rapid read protection, maybe just using latency is enough, > as it can help where the predictive nature of IO wait would have been useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11032) Full trace returned on ReadFailure by cqlsh
[ https://issues.apache.org/jira/browse/CASSANDRA-11032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306135#comment-15306135 ] Zhongxiang Zheng commented on CASSANDRA-11032: -- Thank you for your review! > Full trace returned on ReadFailure by cqlsh > --- > > Key: CASSANDRA-11032 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11032 > Project: Cassandra > Issue Type: Improvement > Components: Tools >Reporter: Chris Splinter >Assignee: Zhongxiang Zheng >Priority: Minor > Labels: cqlsh, lhf > Fix For: 3.7, 3.0.7 > > Attachments: CASSANDRA-11032-trunk.patch > > > I noticed that the full traceback is returned on a read failure where I > expected this to be a one line exception with the ReadFailure message. It is > minor, but would it be better to only return the ReadFailure details? > {code} > cqlsh> SELECT * FROM test_encryption_ks.test_bad_table; > Traceback (most recent call last): > File "/usr/local/lib/dse/bin/../resources/cassandra/bin/cqlsh.py", line > 1276, in perform_simple_statement > result = future.result() > File > "/usr/local/lib/dse/resources/cassandra/bin/../lib/cassandra-driver-internal-only-3.0.0-6af642d.zip/cassandra-driver-3.0.0-6af642d/cassandra/cluster.py", > line 3122, in result > raise self._final_exception > ReadFailure: code=1300 [Replica(s) failed to execute read] message="Operation > failed - received 0 responses and 1 failures" info={'failures': 1, > 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11919) Failure in nodetool decommission
[ https://issues.apache.org/jira/browse/CASSANDRA-11919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306070#comment-15306070 ] vin01 commented on CASSANDRA-11919: --- In "nodetool netstats" , when the decommission task reaches towards end, i get :- Sending 0 files, 32652963850 bytes total. Already sent 0 files, 0 bytes total And it keeps on coming without any change. Will "nodetool removenode" help here? > Failure in nodetool decommission > > > Key: CASSANDRA-11919 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11919 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Centos 6.6 x86_64, Cassandra 2.2.4 >Reporter: vin01 >Priority: Minor > > I keep getting an exception while attempting "nodetool decommission". > ERROR [STREAM-IN-/[NODE_ON_WHICH_DECOMMISSION_RUNNING]] 2016-05-29 > 13:08:39,040 StreamSession.java:524 - [Stream > #b2039080-25c2-11e6-bd92-d71331aaf180] Streaming error occurred > java.lang.IllegalArgumentException: Unknown type 0 > at > org.apache.cassandra.streaming.messages.StreamMessage$Type.get(StreamMessage.java:96) > ~[apache-cassandra-2.2.4.jar:2.2.4] > at > org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:57) > ~[apache-cassandra-2.2.4.jar:2.2.4] > at > org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:261) > ~[apache-cassandra-2.2.4.jar:2.2.4] > Because of these, decommission process is not succeeding. > Is interrupting the decommission process safe? Seems like i will have to > retry to make it work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11838) dtest failure in largecolumn_test:TestLargeColumn.cleanup_test
[ https://issues.apache.org/jira/browse/CASSANDRA-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306065#comment-15306065 ] Alex Petrov commented on CASSANDRA-11838: - I've ran a git bisect and it confirms the commit [1e92ce|https://github.com/apache/cassandra/commit/1e92ce43a5a730f81d3f6cfd72e7f4b126db788a] same mentioned [here|http://cassci.datastax.com/job/trunk_offheap_dtest/185/]. The test is failing also without {{offheap_memtables}}. The patch is cleaning up the references to values within the reused {{BTree}} and "trims" the large values within {{DataOutputBuffer}}. Since the buffer is growing and there's a setup under which the value may grow large enough, even if it's just read once, it'll remain there, so we need to clean it up. Unfortunately, we can't just skip recycling for such objects, as there's no way to pass constructor options (which may be a good thing as {{Recycler}} assumes that all instances are the same, so it can pick whichever is available), so we have to force-trim (in this case, re-allocate the buffer to some threshold value). > dtest failure in largecolumn_test:TestLargeColumn.cleanup_test > -- > > Key: CASSANDRA-11838 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11838 > Project: Cassandra > Issue Type: Bug >Reporter: Philip Thompson >Assignee: Alex Petrov > Labels: dtest > Fix For: 3.x > > Attachments: node1.log, node1_debug.log, node2.log, node2_debug.log > > > Example failure at: > http://cassci.datastax.com/job/trunk_offheap_dtest/200/testReport/largecolumn_test/TestLargeColumn/cleanup_test/ > node 1 contains the following OOM in its log: > {code} > ERROR [SharedPool-Worker-1] 2016-05-16 22:54:10,112 Message.java:611 - > Unexpected exception during request; channel = [id: 0xb97f2640, > L:/127.0.0.1:9042 - R:/127.0.0.1:48190] > java.lang.OutOfMemoryError: Java heap space > at org.apache.cassandra.transport.CBUtil.readRawBytes(CBUtil.java:533) > ~[main/:na] > at > org.apache.cassandra.transport.CBUtil.readBoundValue(CBUtil.java:407) > ~[main/:na] > at org.apache.cassandra.transport.CBUtil.readValueList(CBUtil.java:462) > ~[main/:na] > at > org.apache.cassandra.cql3.QueryOptions$Codec.decode(QueryOptions.java:417) > ~[main/:na] > at > org.apache.cassandra.cql3.QueryOptions$Codec.decode(QueryOptions.java:365) > ~[main/:na] > at > org.apache.cassandra.transport.messages.ExecuteMessage$1.decode(ExecuteMessage.java:45) > ~[main/:na] > at > org.apache.cassandra.transport.messages.ExecuteMessage$1.decode(ExecuteMessage.java:41) > ~[main/:na] > at > org.apache.cassandra.transport.Message$ProtocolDecoder.decode(Message.java:280) > ~[main/:na] > at > org.apache.cassandra.transport.Message$ProtocolDecoder.decode(Message.java:261) > ~[main/:na] > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:277) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:264) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:292) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:278) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:962) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:879) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at >
[jira] [Comment Edited] (CASSANDRA-11818) C* does neither recover nor trigger stability inspector on direct memory OOM
[ https://issues.apache.org/jira/browse/CASSANDRA-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306030#comment-15306030 ] Robert Stupp edited comment on CASSANDRA-11818 at 5/29/16 7:10 PM: --- I've tried [~norman]'s patch against Netty 4.1 against trunk. With the patch enabled (requires {{-Dio.netty.noDirectBufferNoCleaner=false}}) my overloaded node recovers nicely. The CMS-GC-storm caused by {{Bits.reserveMemory()}} does not occur and the node remains responsive. However, while the node is in an overload situation, it spews a lot of errors. Unfortunately these are {{java.lang.OutOfMemoryError: No more memory available}}), which is generally fine, but in this case it just indicates that there is not enough direct memory to fulfill the *current* request. IMO, passing this OOM to {{JMVStabilityInspector}} would be wrong, since it is a recoverable error. (Background: Netty has a separate, distinct direct memory pool then, which does not affect other operations or memory pools.) I've also applied the same technique to C* internal direct memory allocations. (We already use {{FileUtils.clean()}} to cleanup direct buffers.) To summarize, {{Bits.reserveMemory}} + {{Cleaner}} are the root cause IMO. Not having both reduce client latency as a side effect. EDIT: removed client-latency numbers. Was not from 6 to <1ms - but from .8ms to .1ms (99.9percentile). EDIT2: the number of 6ms can actually be correct. Caused by a longer GC, the native-request-pool can grow (e.g. from 140 to 190) - after that, the client-latency suddenly increased from .1ms to 6ms. was (Author: snazy): I've tried [~norman]'s patch against Netty 4.1 against trunk. With the patch enabled (requires {{-Dio.netty.noDirectBufferNoCleaner=false}}) my overloaded node recovers nicely. The CMS-GC-storm caused by {{Bits.reserveMemory()}} does not occur and the node remains responsive. However, while the node is in an overload situation, it spews a lot of errors. Unfortunately these are {{java.lang.OutOfMemoryError: No more memory available}}), which is generally fine, but in this case it just indicates that there is not enough direct memory to fulfill the *current* request. IMO, passing this OOM to {{JMVStabilityInspector}} would be wrong, since it is a recoverable error. (Background: Netty has a separate, distinct direct memory pool then, which does not affect other operations or memory pools.) I've also applied the same technique to C* internal direct memory allocations. (We already use {{FileUtils.clean()}} to cleanup direct buffers.) To summarize, {{Bits.reserveMemory}} + {{Cleaner}} are the root cause IMO. Not having both reduce client latency as a side effect. EDIT: removed client-latency numbers. Was not from 6 to <1ms - but from .8ms to .1ms (99.9percentile). > C* does neither recover nor trigger stability inspector on direct memory OOM > > > Key: CASSANDRA-11818 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11818 > Project: Cassandra > Issue Type: Bug >Reporter: Robert Stupp > Attachments: 11818-direct-mem-unpooled.png, 11818-direct-mem.png, > oom-histo-live.txt, oom-stack.txt > > > The following stack trace is not caught by {{JVMStabilityInspector}}. > Situation was caused by a load test with a lot of parallel writes and reads > against a single node. > {code} > ERROR [SharedPool-Worker-1] 2016-05-17 18:38:44,187 Message.java:611 - > Unexpected exception during request; channel = [id: 0x1e02351b, > L:/127.0.0.1:9042 - R:/127.0.0.1:51087] > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) ~[na:1.8.0_92] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.8.0_92] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > ~[na:1.8.0_92] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:672) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:234) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:218) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:270) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at >
[jira] [Comment Edited] (CASSANDRA-11818) C* does neither recover nor trigger stability inspector on direct memory OOM
[ https://issues.apache.org/jira/browse/CASSANDRA-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306030#comment-15306030 ] Robert Stupp edited comment on CASSANDRA-11818 at 5/29/16 6:57 PM: --- I've tried [~norman]'s patch against Netty 4.1 against trunk. With the patch enabled (requires {{-Dio.netty.noDirectBufferNoCleaner=false}}) my overloaded node recovers nicely. The CMS-GC-storm caused by {{Bits.reserveMemory()}} does not occur and the node remains responsive. However, while the node is in an overload situation, it spews a lot of errors. Unfortunately these are {{java.lang.OutOfMemoryError: No more memory available}}), which is generally fine, but in this case it just indicates that there is not enough direct memory to fulfill the *current* request. IMO, passing this OOM to {{JMVStabilityInspector}} would be wrong, since it is a recoverable error. (Background: Netty has a separate, distinct direct memory pool then, which does not affect other operations or memory pools.) I've also applied the same technique to C* internal direct memory allocations. (We already use {{FileUtils.clean()}} to cleanup direct buffers.) To summarize, {{Bits.reserveMemory}} + {{Cleaner}} are the root cause IMO. Not having both reduce client latency as a side effect. EDIT: removed client-latency numbers. Was not from 6 to <1ms - but from .8ms to .1ms (99.9percentile). was (Author: snazy): I've tried [~norman]'s patch against Netty 4.1 against trunk. With the patch enabled (requires {{-Dio.netty.noDirectBufferNoCleaner=false}}) my overloaded node recovers nicely. The CMS-GC-storm caused by {{Bits.reserveMemory()}} does not occur and the node remains responsive. However, while the node is in an overload situation, it spews a lot of errors. Unfortunately these are {{java.lang.OutOfMemoryError: No more memory available}}), which is generally fine, but in this case it just indicates that there is not enough direct memory to fulfill the *current* request. IMO, passing this OOM to {{JMVStabilityInspector}} would be wrong, since it is a recoverable error. (Background: Netty has a separate, distinct direct memory pool then, which does not affect other operations or memory pools.) I've also applied the same technique to C* internal direct memory allocations. (We already use {{FileUtils.clean()}} to cleanup direct buffers.) To summarize, {{Bits.reserveMemory}} + {{Cleaner}} are the root cause IMO. Not having both reduce client latency as a side effect (from 6ms to <1ms in my test). > C* does neither recover nor trigger stability inspector on direct memory OOM > > > Key: CASSANDRA-11818 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11818 > Project: Cassandra > Issue Type: Bug >Reporter: Robert Stupp > Attachments: 11818-direct-mem-unpooled.png, 11818-direct-mem.png, > oom-histo-live.txt, oom-stack.txt > > > The following stack trace is not caught by {{JVMStabilityInspector}}. > Situation was caused by a load test with a lot of parallel writes and reads > against a single node. > {code} > ERROR [SharedPool-Worker-1] 2016-05-17 18:38:44,187 Message.java:611 - > Unexpected exception during request; channel = [id: 0x1e02351b, > L:/127.0.0.1:9042 - R:/127.0.0.1:51087] > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) ~[na:1.8.0_92] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.8.0_92] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > ~[na:1.8.0_92] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:672) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:234) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:218) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:270) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:105) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:349) > ~[main/:na] > at >
[jira] [Commented] (CASSANDRA-11818) C* does neither recover nor trigger stability inspector on direct memory OOM
[ https://issues.apache.org/jira/browse/CASSANDRA-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306030#comment-15306030 ] Robert Stupp commented on CASSANDRA-11818: -- I've tried [~norman]'s patch against Netty 4.1 against trunk. With the patch enabled (requires {{-Dio.netty.noDirectBufferNoCleaner=false}}) my overloaded node recovers nicely. The CMS-GC-storm caused by {{Bits.reserveMemory()}} does not occur and the node remains responsive. However, while the node is in an overload situation, it spews a lot of errors. Unfortunately these are {{java.lang.OutOfMemoryError: No more memory available}}), which is generally fine, but in this case it just indicates that there is not enough direct memory to fulfill the *current* request. IMO, passing this OOM to {{JMVStabilityInspector}} would be wrong, since it is a recoverable error. (Background: Netty has a separate, distinct direct memory pool then, which does not affect other operations or memory pools.) I've also applied the same technique to C* internal direct memory allocations. (We already use {{FileUtils.clean()}} to cleanup direct buffers.) To summarize, {{Bits.reserveMemory}} + {{Cleaner}} are the root cause IMO. Not having both reduce client latency as a side effect (from 6ms to <1ms in my test). > C* does neither recover nor trigger stability inspector on direct memory OOM > > > Key: CASSANDRA-11818 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11818 > Project: Cassandra > Issue Type: Bug >Reporter: Robert Stupp > Attachments: 11818-direct-mem-unpooled.png, 11818-direct-mem.png, > oom-histo-live.txt, oom-stack.txt > > > The following stack trace is not caught by {{JVMStabilityInspector}}. > Situation was caused by a load test with a lot of parallel writes and reads > against a single node. > {code} > ERROR [SharedPool-Worker-1] 2016-05-17 18:38:44,187 Message.java:611 - > Unexpected exception during request; channel = [id: 0x1e02351b, > L:/127.0.0.1:9042 - R:/127.0.0.1:51087] > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) ~[na:1.8.0_92] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.8.0_92] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > ~[na:1.8.0_92] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:672) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:234) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:218) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:270) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:105) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:349) > ~[main/:na] > at > org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:314) > ~[main/:na] > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:619) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:676) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:612) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > org.apache.cassandra.transport.Message$Dispatcher$Flusher.run(Message.java:445) > ~[main/:na] > at > io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at
[jira] [Created] (CASSANDRA-11919) Failure in nodetool decommission
vin01 created CASSANDRA-11919: - Summary: Failure in nodetool decommission Key: CASSANDRA-11919 URL: https://issues.apache.org/jira/browse/CASSANDRA-11919 Project: Cassandra Issue Type: Bug Components: Streaming and Messaging Environment: Centos 6.6 x86_64, Cassandra 2.2.4 Reporter: vin01 Priority: Minor I keep getting an exception while attempting "nodetool decommission". ERROR [STREAM-IN-/[NODE_ON_WHICH_DECOMMISSION_RUNNING]] 2016-05-29 13:08:39,040 StreamSession.java:524 - [Stream #b2039080-25c2-11e6-bd92-d71331aaf180] Streaming error occurred java.lang.IllegalArgumentException: Unknown type 0 at org.apache.cassandra.streaming.messages.StreamMessage$Type.get(StreamMessage.java:96) ~[apache-cassandra-2.2.4.jar:2.2.4] at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:57) ~[apache-cassandra-2.2.4.jar:2.2.4] at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:261) ~[apache-cassandra-2.2.4.jar:2.2.4] Because of these, decommission process is not succeeding. Is interrupting the decommission process safe? Seems like i will have to retry to make it work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11875) Create sstableconvert tool with support to ma format
[ https://issues.apache.org/jira/browse/CASSANDRA-11875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306004#comment-15306004 ] Paulo Motta commented on CASSANDRA-11875: - Thanks for the update! See follow-up comments below: * There is still some common code between {{StandaloneConverter}} and {{StandaloneUpgrader}} main methods, most of it can be extracted to common methods use by both. * The supported version check will probably be used in other places, so we should probably move it to {{BigVersion}}. There are already {{isCompatible}} and {{isCompatibleForStream}} methods, so we can maybe add a {{isCompatibeForWriting}}. It would be nice if you could add a unit test that checks that trying to converting to an unsupported version should fail. bq. Also added a new Test suit SSTableConversionTest, but it has some problem when performing conversion. It seems this is due to re-loading the schema on {{Schema.instance.loadFromDisk(false)}} on {{StandaloneConverter}}. Since we want to focus testing the conversion itself, we will probably have more flexibility testing the internal class {{SSTableConverter}}, so we don't have to add special options for testing on {{StandaloneConverter}} and can also play around with SSTableReaders directly. Instead of basing our tests on {{SSTableRewriterTest}} as initially discussed, it's probably more convenient to base it on {{CQLTester}} since we will be doing data conversions and testing this at a higher level with CQL is the only way to ensure converted data is being interpreted correctly. The tests should have more or less the following structure: * Insert Data with CQL * Flush to disk * Read and verify data in current version with CQL * Keep reference to original SSTableReaders and cleanup ColumnFamilyStore (clearUnsafe) * Perform conversion on original SSTableReaders * Verify metadata was converted correctly on converted SSTableReaders * Add converted SSTableReaders to ColumnFamilyStore (addSSTable) * Read and verify data in converted version with CQL For the mb to ma conversion, since there is no data conversion involved (only metadata), you can use {{SimpleQueryTest.testTableWithoutClustering}} as a example to write the first test case. You may use utility {{CQLTester}} utility methods such as {{flush()}} and {{getCurrentColumnFamilyStore()}} to flush and access {{ColumnFamilyStore}}. > Create sstableconvert tool with support to ma format > > > Key: CASSANDRA-11875 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11875 > Project: Cassandra > Issue Type: Sub-task > Components: Tools >Reporter: Paulo Motta >Assignee: Kaide Mu >Priority: Minor > Attachments: trunk-11875-WIP-V1.patch > > > Currently {{Upgrader}} receives an sstable in any readable format, and writes > into {{BigFormat.getLatestVersion()}}. We should generalize it by making it > receive a {{target}} version and probably also rename it to > {{SSTableConverter}}. > Based on this we can create an {{StandaloneDowngrader}} tool which will > perform downgrade of specified sstables to a target version. To start with, > we should support only downgrading to {{ma}} format (from current format > {{mb}}), downgrade to any other version should be forbidden. Since we already > support serializing to "ma" we will not need to do any data conversion. > We should also create a test suite that creates an sstable with data in the > current format, perform the downgrade, and verify data in the new format is > correct. This will be the base tests suite for more advanced conversions in > the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)