Multiget performance
Hi all, I’ve always been told that multigets are a Cassandra anti-pattern for performance reasons. I ran a quick test tonight to prove it to myself, and, sure enough, slowness ensued. It takes about 150ms to get 100 keys for my use case. Not terrible, but at least an order of magnitude from what I need it to be. So far, I’ve been able to denormalize and not have any problems. Today, I ran into a use case where denormalization introduces a huge amount of complexity to the code. It’s very tempting to cache a subset in Redis and call it a day — probably will. But, that’s not a very satisfying answer. It’s only about 5GB of data and it feels like I should be able to tune a Cassandra CF to be within 2x. The workload is around 70% reads. Most of the writes are updates to existing data. Currently, it’s in an LCS CF with ~30M rows. The cluster is 300GB total with 3-way replication, running across 12 fairly large boxes with 16G RAM. All on SSDs. Striped across 3 AZs in AWS (hi1.4xlarges, fwiw). Has anyone had success getting good results for this kind of workload? Or, is Cassandra just not suited for it at all and I should just use an in-memory store? -Allan
Re: binary protocol server side sockets
Hello Graham You can use the following code with the official Java driver: SocketOptions socketOptions = new SocketOptions(); socketOptions.setKeepAlive(true); Cluster.builder().addContactPoints(contactPointsList) .withPort(cql3Port) .withCompression(ProtocolOptions.Compression.SNAPPY) .withCredentials(cassandraUsername, cassandraPassword) .withSocketOptions(socketOptions) .build(); or : alreadyBuiltClusterInstance.getConfiguration().getSocketOptions().setKeepAlive(true); Althought I'm not sure if the second alternative does work because the cluster is already built and maybe the connection is already established... Regards Duy Hai DOAN On Wed, Apr 9, 2014 at 12:59 AM, graham sanderson gra...@vast.com wrote: Is there a way to configure KEEPALIVE on the server end sockets of the binary protocol. rpc_keepalive only affects thrift. This is on 2.0.5 Thanks, Graham
Re: Multiget performance
Are you making the 100 calls in serial, or in parallel? Thanks, Daniel On Tue, Apr 8, 2014 at 11:22 PM, Allan C alla...@gmail.com wrote: Hi all, I've always been told that multigets are a Cassandra anti-pattern for performance reasons. I ran a quick test tonight to prove it to myself, and, sure enough, slowness ensued. It takes about 150ms to get 100 keys for my use case. Not terrible, but at least an order of magnitude from what I need it to be. So far, I've been able to denormalize and not have any problems. Today, I ran into a use case where denormalization introduces a huge amount of complexity to the code. It's very tempting to cache a subset in Redis and call it a day -- probably will. But, that's not a very satisfying answer. It's only about 5GB of data and it feels like I should be able to tune a Cassandra CF to be within 2x. The workload is around 70% reads. Most of the writes are updates to existing data. Currently, it's in an LCS CF with ~30M rows. The cluster is 300GB total with 3-way replication, running across 12 fairly large boxes with 16G RAM. All on SSDs. Striped across 3 AZs in AWS (hi1.4xlarges, fwiw). Has anyone had success getting good results for this kind of workload? Or, is Cassandra just not suited for it at all and I should just use an in-memory store? -Allan
RE: Commit logs building up
Nate, What values for the FlushWriter line would draw concern to you? What is the difference between Blocked and All Time Blocked? Parag From: Nate McCall [mailto:n...@thelastpickle.com] Sent: Thursday, February 27, 2014 4:22 PM To: Cassandra Users Subject: Re: Commit logs building up What was the impetus for turning up the commitlog_segment_size_in_mb? Also, in nodetool tpstats, do what are the values for the FlushWriter line? On Wed, Feb 26, 2014 at 12:18 PM, Christopher Wirt chris.w...@struq.commailto:chris.w...@struq.com wrote: We're running 2.0.5, recently upgraded from 1.2.14. Sometimes we are seeing CommitLogs starting to build up. Is this a potential bug? Or a symptom of something else we can easily address? We have commitlog_sync: periodic commitlog_sync_period_in_ms:1 commitlog_segment_size_in_mb: 512 Thanks, Chris -- - Nate McCall Austin, TX @zznate Co-Founder Sr. Technical Consultant Apache Cassandra Consulting http://www.thelastpickle.com
Commitlog questions
1) Why is the default 4GB? Has anyone changed this? What are some aspects to consider when determining the commitlog size? 2) If the commitlog is in periodic mode, there is a property to set a time interval to flush the incoming mutations to disk. This implies that there is a queue inside Cassandra to hold this data in memory until it is flushed. a. Is there a name for this queue? b. Is there a limit for this queue? c. Are there any tuning parameters for this queue? Thanks, Parag
[no subject]
Hi all, I'm getting the following error in a 2.0.6 instance: ERROR [Native-Transport-Requests:16633] 2014-04-09 10:11:45,811 ErrorMessage.java (line 222) Unexpected exception during request java.lang.AssertionError: localhost/127.0.0.1 at org.apache.cassandra.service.StorageProxy.submitHint(StorageProxy.java:860) at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:480) at org.apache.cassandra.service.StorageProxy.mutateWithTriggers(StorageProxy.java:524) at org.apache.cassandra.cql3.statements.BatchStatement.executeWithoutConditions(BatchStatement.java:210) at org.apache.cassandra.cql3.statements.BatchStatement.execute(BatchStatement.java:203) at org.apache.cassandra.cql3.statements.BatchStatement.executeWithPerStatementVariables(BatchStatement.java:192) at org.apache.cassandra.cql3.QueryProcessor.processBatch(QueryProcessor.java:373) at org.apache.cassandra.transport.messages.BatchMessage.execute(BatchMessage.java:206) at org.apache.cassandra.transport.Message$Dispatcher.messageReceived(Message.java:304) at org.jboss.netty.handler.execution.ChannelUpstreamEventRunnable.doRun(ChannelUpstreamEventRunnable.java:43) at org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:67) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Looking at the source for this, it appears to be related to a timeout: // local write that time out should be handled by LocalMutationRunnable assert !target.equals(FBUtilities.getBroadcastAddress()) : target; Cursory testing indicates that this occurs during larger batch ingests. But the error does not appear to be propagated properly back to the client and it seems like this could be due to some misconfiguration. Has anybody seen something like this before? Cheers, Ben
nodetool repair loops version 2.0.6
Have a test cluster with three nodes each in two datacenters. The following causes nodetool repair to go into an (apparent) infinite loop. This is with 2.0.6. On node 10.140.140.101: cqlsh CREATE KEYSPACE looptest WITH replication = { ... 'class': 'NetworkTopologyStrategy', ... '140': '2', ... '141': '2' ... }; cqlsh use looptest; cqlsh:looptest CREATE TABLE a_table ( ... id uuid, ... description text, ... PRIMARY KEY (id) ... ); cqlsh:looptest On node 10.140.140.102: [default@unknown] describe cluster; Cluster Information: Name: Dev Cluster Snitch: org.apache.cassandra.locator.RackInferringSnitch Partitioner: org.apache.cassandra.dht.Murmur3Partitioner Schema versions: e7c46d59-fceb-38b5-947c-dcbd14950a4c: [10.141.140.101, 10.140.140.101, 10.140.140.102, 10.141.140.103, 10.141.140.102, 10.140.140.103] nodetool status: Datacenter: 141 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.141.140.101 25.09 MB 256 15.6% 3f0d60bf-dfcd-42a9-9cff-8b76146359e3 140 UN 10.141.140.102 27.83 MB 256 16.7% bbdcc640-278e-4d3d-ac12-fcb4d837d0e1 140 UN 10.141.140.103 23.78 MB 256 16.5% b030e290-b8da-4883-a13d-b2529fab37fe 140 Datacenter: 140 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.140.140.103 65.26 MB 256 18.1% 52a9a718-2bed-4972-ab11-bd97a8d8539c 140 UN 10.140.140.101 69.46 MB 256 17.6% d59300db-6179-484e-9ca1-8d1eada0701a 140 UN 10.140.140.102 68.08 MB 256 15.4% 22e504c9-1cc6-4744-b302-32bb5116d409 140 Back on 10.140.140.101: nodetool repair looptest never returns. Looking in the system.log, it is continuously looping with: INFO [AntiEntropySessions:818] 2014-04-09 13:23:31,889 RepairSession.java (line 282) [repair #24b2b1b0-bfea-11e3-85a3-911072ba5322] session completed successfully INFO [AntiEntropySessions:816] 2014-04-09 13:23:31,916 RepairSession.java (line 244) [repair #253687b0-bfea-11e3-85a3-911072ba5322] new session: will sync /10.140.140.101, /10.141.140.103, /10.140.140.103, /10.141.140.102 on range (-4377479664111251829,-4360027703686042340] for looptest.[a_table] INFO [AntiEntropyStage:1] 2014-04-09 13:23:31,949 RepairSession.java (line 164) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Received merkle tree for a_table from /10.141.140.102 INFO [RepairJobTask:3] 2014-04-09 13:23:32,002 RepairJob.java (line 134) [repair #253687b0-bfea-11e3-85a3-911072ba5322] requesting merkle trees for a_table (to [/10.141.140.103, /10.140.140.103, /10.141.140.102, /10.140.140.101]) INFO [AntiEntropyStage:1] 2014-04-09 13:23:32,007 RepairSession.java (line 164) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Received merkle tree for a_table from /10.140.140.101 INFO [RepairJobTask:3] 2014-04-09 13:23:32,012 Differencer.java (line 67) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Endpoints /10.141.140.101 and /10.140.140.103 are consistent for a_table INFO [RepairJobTask:2] 2014-04-09 13:23:32,016 Differencer.java (line 67) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Endpoints /10.141.140.101 and /10.140.140.101 are consistent for a_table INFO [RepairJobTask:1] 2014-04-09 13:23:32,016 Differencer.java (line 67) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Endpoints /10.141.140.101 and /10.141.140.102 are consistent for a_table INFO [RepairJobTask:4] 2014-04-09 13:23:32,016 Differencer.java (line 67) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Endpoints /10.140.140.103 and /10.141.140.102 are consistent for a_table INFO [RepairJobTask:5] 2014-04-09 13:23:32,016 Differencer.java (line 67) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Endpoints /10.140.140.103 and /10.140.140.101 are consistent for a_table INFO [RepairJobTask:6] 2014-04-09 13:23:32,016 Differencer.java (line 67) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Endpoints /10.141.140.102 and /10.140.140.101 are consistent for a_table INFO [AntiEntropyStage:1] 2014-04-09 13:23:32,018 RepairSession.java (line 221) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] a_table is fully synced INFO [AntiEntropySessions:817] 2014-04-09 13:23:32,019 RepairSession.java (line 282) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] session completed successfully INFO [AntiEntropySessions:818] 2014-04-09 13:23:32,043 RepairSession.java (line 244) [repair #2549c190-bfea-11e3-85a3-911072ba5322] new session: will sync /10.140.140.101, /10.141.140.103, /10.140.140.102, /10.141.140.102 on range (-3457228189350977014,-3443426249422196914] for looptest.[a_table] INFO [RepairJobTask:3] 2014-04-09 13:23:32,169 RepairJob.java (line 134) [repair #2549c190-bfea-11e3-85a3-911072ba5322] requesting merkle trees for a_table (to [/10.141.140.103, /10.140.140.102, /10.141.140.102,
Re: Apache cassandra not joining cluster ring
Hello All, Kindly help with below issues, I'm really stuck here. Thanks, Joy On 8 April 2014 21:55, Joyabrata Das joy.luv.challen...@gmail.com wrote: Hello, I've a four node apache cassandra community 1.2 cluster in single datacenter with a seed. All configurations are similar in cassandra.yaml file. The following issues are faced, please help. 1] Though fourth node isn't listed in nodetool ring or status command, system.log displayed only this node isn't communicating via gossip protoccol with other nodes. However both jmx telnet port is enabled with proper listen/seed address configured. 2] Though Opscenter is able to recognize all four nodes, the agents are not getting installed from opscenter. However same JVM version is installed as well as JAVA_HOME is also set in all four nodes. Further observed that problematic node has Ubuntu 64-Bit other nodes are Ubuntu 32-Bit, can it be the reason? Thanks, Joy
Re: nodetool repair loops version 2.0.6
In fact, it did eventually finish in ~20 minutes. Is this duration expected/normal? --Kevin On Wed, Apr 9, 2014 at 9:32 AM, Kevin McLaughlin kmcla...@gmail.com wrote: Have a test cluster with three nodes each in two datacenters. The following causes nodetool repair to go into an (apparent) infinite loop. This is with 2.0.6. On node 10.140.140.101: cqlsh CREATE KEYSPACE looptest WITH replication = { ... 'class': 'NetworkTopologyStrategy', ... '140': '2', ... '141': '2' ... }; cqlsh use looptest; cqlsh:looptest CREATE TABLE a_table ( ... id uuid, ... description text, ... PRIMARY KEY (id) ... ); cqlsh:looptest On node 10.140.140.102: [default@unknown] describe cluster; Cluster Information: Name: Dev Cluster Snitch: org.apache.cassandra.locator.RackInferringSnitch Partitioner: org.apache.cassandra.dht.Murmur3Partitioner Schema versions: e7c46d59-fceb-38b5-947c-dcbd14950a4c: [10.141.140.101, 10.140.140.101, 10.140.140.102, 10.141.140.103, 10.141.140.102, 10.140.140.103] nodetool status: Datacenter: 141 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.141.140.101 25.09 MB 256 15.6% 3f0d60bf-dfcd-42a9-9cff-8b76146359e3 140 UN 10.141.140.102 27.83 MB 256 16.7% bbdcc640-278e-4d3d-ac12-fcb4d837d0e1 140 UN 10.141.140.103 23.78 MB 256 16.5% b030e290-b8da-4883-a13d-b2529fab37fe 140 Datacenter: 140 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.140.140.103 65.26 MB 256 18.1% 52a9a718-2bed-4972-ab11-bd97a8d8539c 140 UN 10.140.140.101 69.46 MB 256 17.6% d59300db-6179-484e-9ca1-8d1eada0701a 140 UN 10.140.140.102 68.08 MB 256 15.4% 22e504c9-1cc6-4744-b302-32bb5116d409 140 Back on 10.140.140.101: nodetool repair looptest never returns. Looking in the system.log, it is continuously looping with: INFO [AntiEntropySessions:818] 2014-04-09 13:23:31,889 RepairSession.java (line 282) [repair #24b2b1b0-bfea-11e3-85a3-911072ba5322] session completed successfully INFO [AntiEntropySessions:816] 2014-04-09 13:23:31,916 RepairSession.java (line 244) [repair #253687b0-bfea-11e3-85a3-911072ba5322] new session: will sync /10.140.140.101, /10.141.140.103, /10.140.140.103, /10.141.140.102 on range (-4377479664111251829,-4360027703686042340] for looptest.[a_table] INFO [AntiEntropyStage:1] 2014-04-09 13:23:31,949 RepairSession.java (line 164) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Received merkle tree for a_table from /10.141.140.102 INFO [RepairJobTask:3] 2014-04-09 13:23:32,002 RepairJob.java (line 134) [repair #253687b0-bfea-11e3-85a3-911072ba5322] requesting merkle trees for a_table (to [/10.141.140.103, /10.140.140.103, /10.141.140.102, /10.140.140.101]) INFO [AntiEntropyStage:1] 2014-04-09 13:23:32,007 RepairSession.java (line 164) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Received merkle tree for a_table from /10.140.140.101 INFO [RepairJobTask:3] 2014-04-09 13:23:32,012 Differencer.java (line 67) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Endpoints /10.141.140.101 and /10.140.140.103 are consistent for a_table INFO [RepairJobTask:2] 2014-04-09 13:23:32,016 Differencer.java (line 67) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Endpoints /10.141.140.101 and /10.140.140.101 are consistent for a_table INFO [RepairJobTask:1] 2014-04-09 13:23:32,016 Differencer.java (line 67) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Endpoints /10.141.140.101 and /10.141.140.102 are consistent for a_table INFO [RepairJobTask:4] 2014-04-09 13:23:32,016 Differencer.java (line 67) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Endpoints /10.140.140.103 and /10.141.140.102 are consistent for a_table INFO [RepairJobTask:5] 2014-04-09 13:23:32,016 Differencer.java (line 67) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Endpoints /10.140.140.103 and /10.140.140.101 are consistent for a_table INFO [RepairJobTask:6] 2014-04-09 13:23:32,016 Differencer.java (line 67) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] Endpoints /10.141.140.102 and /10.140.140.101 are consistent for a_table INFO [AntiEntropyStage:1] 2014-04-09 13:23:32,018 RepairSession.java (line 221) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] a_table is fully synced INFO [AntiEntropySessions:817] 2014-04-09 13:23:32,019 RepairSession.java (line 282) [repair #24e867b0-bfea-11e3-85a3-911072ba5322] session completed successfully INFO [AntiEntropySessions:818] 2014-04-09 13:23:32,043 RepairSession.java (line 244) [repair #2549c190-bfea-11e3-85a3-911072ba5322] new session: will sync /10.140.140.101, /10.141.140.103, /10.140.140.102, /10.141.140.102 on range
Re: Apache cassandra not joining cluster ring
Hello The nodetool status that you mentioned, was that executed on the 4th node itself? Also What does netstat display? Are the correct ports listening on that node? Per opscenter, What version of opscenter are you using? Are you able to manually start the agents on the nodes themselves? On Apr 9, 2014, at 6:57 AM, Joyabrata Das joy.luv.challen...@gmail.com wrote: Hello All, Kindly help with below issues, I'm really stuck here. Thanks, Joy On 8 April 2014 21:55, Joyabrata Das joy.luv.challen...@gmail.com wrote: Hello, I've a four node apache cassandra community 1.2 cluster in single datacenter with a seed. All configurations are similar in cassandra.yaml file. The following issues are faced, please help. 1] Though fourth node isn't listed in nodetool ring or status command, system.log displayed only this node isn't communicating via gossip protoccol with other nodes. However both jmx telnet port is enabled with proper listen/seed address configured. 2] Though Opscenter is able to recognize all four nodes, the agents are not getting installed from opscenter. However same JVM version is installed as well as JAVA_HOME is also set in all four nodes. Further observed that problematic node has Ubuntu 64-Bit other nodes are Ubuntu 32-Bit, can it be the reason? Thanks, Joy
Re: Apache cassandra not joining cluster ring
As Jonathan also asked for some various details, perhaps it would be helpful to be very specific about who, what, when, where, why, what you tried, actual errors, versions, pastebins of configs, etc. Provide the things that might be needed for people to help you out. For instance, the statement that All configurations are similar in cassandra.yaml means nothing to anyone on the list, if they can't see them to tell you, oh, here on line XX, you have blah, and it should be blarg. -- Kind regards, Michael On 04/09/2014 08:56 AM, Joyabrata Das wrote: Hello All, Kindly help with below issues, I'm really stuck here. Thanks, Joy On 8 April 2014 21:55, Joyabrata Das joy.luv.challen...@gmail.com mailto:joy.luv.challen...@gmail.com wrote: Hello, I've a four node apache cassandra community 1.2 cluster in single datacenter with a seed. All configurations are similar in cassandra.yaml file. The following issues are faced, please help. 1] Though fourth node isn't listed in nodetool ring or status command, system.log displayed only this node isn't communicating via gossip protoccol with other nodes. However both jmx telnet port is enabled with proper listen/seed address configured. 2] Though Opscenter is able to recognize all four nodes, the agents are not getting installed from opscenter. However same JVM version is installed as well as JAVA_HOME is also set in all four nodes. Further observed that problematic node has Ubuntu 64-Bit other nodes are Ubuntu 32-Bit, can it be the reason? Thanks, Joy
Re: Commitlog questions
Parag: To answer your questions: 1) Default is just that, a default. I wouldn't advise raising it though. The bigger it is the longer it takes to restart the node. 2) I think they juse use fsync. There is no queue. All files in cassandra use java.nio buffers, but they need to be fsynced periodically. Look at commitlog_sync parameters in cassandra.yaml file, the comments there explain how it works. I believe the difference between periodic and batch is just that -- if it is periodic, it will fsync every 10 seconds, if it is batch it will fsync if there were any changes within a time window. On 2014-04-09 10:06:52 +, Parag Patel said: 1) Why is the default 4GB? Has anyone changed this? What are some aspects to consider when determining the commitlog size? 2) If the commitlog is in periodic mode, there is a property to set a time interval to flush the incoming mutations to disk. This implies that there is a queue inside Cassandra to hold this data in memory until it is flushed. a. Is there a name for this queue? b. Is there a limit for this queue? c. Are there any tuning parameters for this queue? Thanks, Parag -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Apache cassandra not joining cluster ring
On 04/08/2014 11:25 AM, Joyabrata Das wrote: Further observed that problematic node has Ubuntu 64-Bit other nodes are Ubuntu 32-Bit, can it be the reason? This may not be recommended, might/should(?) work, and may be a reason [0]. My first suggestion would be to remove this variable. This would also give you a chance to go through the steps of adding a new node to the cluster again [1] - you might stumble on something. [0] http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/can-I-have-a-mix-of-32-and-64-bit-machines-in-a-cluster-td7583051.html [1] http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_add_node_to_cluster_t.html -- Michael
Re: Multiget performance
As one CQL statement: SELECT * from Event WHERE key IN ([100 keys]); -Allan On April 9, 2014 at 12:52:13 AM, Daniel Chia (danc...@coursera.org) wrote: Are you making the 100 calls in serial, or in parallel? Thanks, Daniel On Tue, Apr 8, 2014 at 11:22 PM, Allan C alla...@gmail.com wrote: Hi all, I’ve always been told that multigets are a Cassandra anti-pattern for performance reasons. I ran a quick test tonight to prove it to myself, and, sure enough, slowness ensued. It takes about 150ms to get 100 keys for my use case. Not terrible, but at least an order of magnitude from what I need it to be. So far, I’ve been able to denormalize and not have any problems. Today, I ran into a use case where denormalization introduces a huge amount of complexity to the code. It’s very tempting to cache a subset in Redis and call it a day — probably will. But, that’s not a very satisfying answer. It’s only about 5GB of data and it feels like I should be able to tune a Cassandra CF to be within 2x. The workload is around 70% reads. Most of the writes are updates to existing data. Currently, it’s in an LCS CF with ~30M rows. The cluster is 300GB total with 3-way replication, running across 12 fairly large boxes with 16G RAM. All on SSDs. Striped across 3 AZs in AWS (hi1.4xlarges, fwiw). Has anyone had success getting good results for this kind of workload? Or, is Cassandra just not suited for it at all and I should just use an in-memory store? -Allan
Re: binary protocol server side sockets
Thanks, but I would think that just sets keep alive from the client end; I’m talking about the server end… this is one of those issues where there is something (e.g. switch, firewall, VPN in between the client and the server) and we get left with orphaned established connections to the server when the client is gone. On Apr 9, 2014, at 2:48 AM, DuyHai Doan doanduy...@gmail.com wrote: Hello Graham You can use the following code with the official Java driver: SocketOptions socketOptions = new SocketOptions(); socketOptions.setKeepAlive(true); Cluster.builder().addContactPoints(contactPointsList) .withPort(cql3Port) .withCompression(ProtocolOptions.Compression.SNAPPY) .withCredentials(cassandraUsername, cassandraPassword) .withSocketOptions(socketOptions) .build(); or : alreadyBuiltClusterInstance.getConfiguration().getSocketOptions().setKeepAlive(true); Althought I'm not sure if the second alternative does work because the cluster is already built and maybe the connection is already established... Regards Duy Hai DOAN On Wed, Apr 9, 2014 at 12:59 AM, graham sanderson gra...@vast.com wrote: Is there a way to configure KEEPALIVE on the server end sockets of the binary protocol. rpc_keepalive only affects thrift. This is on 2.0.5 Thanks, Graham smime.p7s Description: S/MIME cryptographic signature
Re: binary protocol server side sockets
On 04/09/2014 11:39 AM, graham sanderson wrote: Thanks, but I would think that just sets keep alive from the client end; I’m talking about the server end… this is one of those issues where there is something (e.g. switch, firewall, VPN in between the client and the server) and we get left with orphaned established connections to the server when the client is gone. There would be no server setting for any service, not just c*, that would correct mis-configured connection-assassinating network gear between the client and server. Fix the gear to allow persistent connections. Digging through the various timeouts in c*.yaml didn't lead me to a simple answer for something tunable, but I think this may be more basic networking related. I believe it's up to the client to keep the connection open as Duy indicated. I don't think c* will arbitrarily sever connections - something that disconnects the client may happen. In that case, the TCP connection on the server should drop to TIME_WAIT. Is this what you are seeing in `netstat -a` on the server - a bunch of TIME_WAIT connections hanging around? Those should eventually be recycled, but that's tunable in the network stack, if they are being generated at a high rate. -- Michael
Re: binary protocol server side sockets
Michael, it is not that the connections are being dropped, it is that the connections are not being dropped. These server side sockets are ESTABLISHED, even though the client connection on the other side of the network device is long gone. This may well be an issue with the network device (it is valiantly trying to keep the connection alive it seems). That said KEEPALIVE on the server side would not be a bad idea. At least then the OS on the server would eventually (probably after 2 hours of inactivity) attempt to ping the client. At that point hopefully something interesting would happen perhaps causing an error and destroying the server side socket (note KEEPALIVE is also good for preventing idle connections from being dropped by other network devices along the way) rpc_keepalive on the server sets keep alive on the server side sockets for thrift, and is true by default There doesn’t seem to be a setting for the native protocol Note this isn’t a huge issue for us, they can be cleaned up by a rolling restart, and this particular case is not production, but related to development/testing against alpha by people working remotely over VPN - and it may well be the VPNs fault in this case… that said and maybe this is a dev list question, it seems like the option to set keepalive should exist. On Apr 9, 2014, at 12:25 PM, Michael Shuler mich...@pbandjelly.org wrote: On 04/09/2014 11:39 AM, graham sanderson wrote: Thanks, but I would think that just sets keep alive from the client end; I’m talking about the server end… this is one of those issues where there is something (e.g. switch, firewall, VPN in between the client and the server) and we get left with orphaned established connections to the server when the client is gone. There would be no server setting for any service, not just c*, that would correct mis-configured connection-assassinating network gear between the client and server. Fix the gear to allow persistent connections. Digging through the various timeouts in c*.yaml didn't lead me to a simple answer for something tunable, but I think this may be more basic networking related. I believe it's up to the client to keep the connection open as Duy indicated. I don't think c* will arbitrarily sever connections - something that disconnects the client may happen. In that case, the TCP connection on the server should drop to TIME_WAIT. Is this what you are seeing in `netstat -a` on the server - a bunch of TIME_WAIT connections hanging around? Those should eventually be recycled, but that's tunable in the network stack, if they are being generated at a high rate. -- Michael smime.p7s Description: S/MIME cryptographic signature
Re: Commitlog questions
On Wed, Apr 9, 2014 at 3:06 AM, Parag Patel ppa...@clearpoolgroup.comwrote: some questions about the commitlog and related assumptions https://issues.apache.org/jira/browse/CASSANDRA-6764 You might wish to get in contact with the reporter here, who has similar questions! =Rob
Re: Commit logs building up
On Wed, Apr 9, 2014 at 3:06 AM, Parag Patel ppa...@clearpoolgroup.comwrote: What values for the FlushWriter line would draw concern to you? What is the difference between Blocked and All Time Blocked? Non-zero all time blocked. Because if the FlushWriter is blocked, you probably don't have enough io to flush quickly enough. Blocked is currently blocked, all time blocked is blocked since node startup. =Rob
Re: nodetool repair loops version 2.0.6
On Wed, Apr 9, 2014 at 7:09 AM, Kevin McLaughlin kmcla...@gmail.com wrote: In fact, it did eventually finish in ~20 minutes. Is this duration expected/normal? https://issues.apache.org/jira/browse/CASSANDRA-5220 =Rob
Re: binary protocol server side sockets
On 04/09/2014 12:41 PM, graham sanderson wrote: Michael, it is not that the connections are being dropped, it is that the connections are not being dropped. Thanks for the clarification. These server side sockets are ESTABLISHED, even though the client connection on the other side of the network device is long gone. This may well be an issue with the network device (it is valiantly trying to keep the connection alive it seems). Have you tested if they *ever* time out on their own, or do they just keep sticking around forever? (maybe 432000 sec (120 hours), which is the default for nf_conntrack_tcp_timeout_established?) Trying out all the usage scenarios is really the way to track it down - directly on switch, behind/in front of firewall, on/off the VPN. That said KEEPALIVE on the server side would not be a bad idea. At least then the OS on the server would eventually (probably after 2 hours of inactivity) attempt to ping the client. At that point hopefully something interesting would happen perhaps causing an error and destroying the server side socket (note KEEPALIVE is also good for preventing idle connections from being dropped by other network devices along the way) Tuning net.ipv4.tcp_keepalive_* could be helpful, if you know they timeout after 2 hours, which is the default. rpc_keepalive on the server sets keep alive on the server side sockets for thrift, and is true by default There doesn’t seem to be a setting for the native protocol Note this isn’t a huge issue for us, they can be cleaned up by a rolling restart, and this particular case is not production, but related to development/testing against alpha by people working remotely over VPN - and it may well be the VPNs fault in this case… that said and maybe this is a dev list question, it seems like the option to set keepalive should exist. Yeah, but I agree you shouldn't have to restart to clean up connections - that's why I think it is lower in the network stack, and that a bit of troubleshooting and tuning might be helpful. That setting sounds like a good Jira request - keepalive may be the default, I'm not sure. :) -- Michael On Apr 9, 2014, at 12:25 PM, Michael Shuler mich...@pbandjelly.org wrote: On 04/09/2014 11:39 AM, graham sanderson wrote: Thanks, but I would think that just sets keep alive from the client end; I’m talking about the server end… this is one of those issues where there is something (e.g. switch, firewall, VPN in between the client and the server) and we get left with orphaned established connections to the server when the client is gone. There would be no server setting for any service, not just c*, that would correct mis-configured connection-assassinating network gear between the client and server. Fix the gear to allow persistent connections. Digging through the various timeouts in c*.yaml didn't lead me to a simple answer for something tunable, but I think this may be more basic networking related. I believe it's up to the client to keep the connection open as Duy indicated. I don't think c* will arbitrarily sever connections - something that disconnects the client may happen. In that case, the TCP connection on the server should drop to TIME_WAIT. Is this what you are seeing in `netstat -a` on the server - a bunch of TIME_WAIT connections hanging around? Those should eventually be recycled, but that's tunable in the network stack, if they are being generated at a high rate. -- Michael
Update SSTable fragmentation
I've been doing a lot of reading on SSTable fragmentation due to updates and the costs associated with reconstructing the end data from multiple SSTables that have been created over time and not yet compacted. One question is stuck in my head: If you re-insert entire rows instead of updating one column, will cassandra end flushing that entire row into one SSTable on disk and then end up up finding a non fragmented entire row quickly on reads instead of potential reconstruction across multiple SSTables? Obviously this has implications for space as a trade off. Wayne
Per-keyspace partitioners?
Hi everyone, Is there a way to change the partitioner on a per-table or per-keyspace basis? We have some tables for which we'd like to enable ordered scans of rows, so we'd like to use the ByteOrdered partitioner for those, but use Murmur3 for everything else in our cluster. Is this possible? Or does the partitioner have to be the same for the entire cluster? Best regards, Clint
Re: Per-keyspace partitioners?
Hello, Partitioner is per cluster. We have seen users create separate clusters for items like this, but that's an edge case. Jonathan Jonathan Lacefield Solutions Architect, DataStax (404) 822 3487 http://www.linkedin.com/in/jlacefield http://www.datastax.com/cassandrasummit14 On Wed, Apr 9, 2014 at 11:57 AM, Clint Kelly clint.ke...@gmail.com wrote: Hi everyone, Is there a way to change the partitioner on a per-table or per-keyspace basis? We have some tables for which we'd like to enable ordered scans of rows, so we'd like to use the ByteOrdered partitioner for those, but use Murmur3 for everything else in our cluster. Is this possible? Or does the partitioner have to be the same for the entire cluster? Best regards, Clint
Re: binary protocol server side sockets
Thanks Michael, Yup keepalive is not the default. It is possible they are going away after nf_conntrack_tcp_timeout_established; will have to do more digging (it is hard to tell how old a connection is - there are no visible timers (thru netstat) on an ESTABLISHED connection))… This is actually low on my priority list, I was just spending a bit of time trying to track down the source of ERROR [Native-Transport-Requests:3833603] 2014-04-09 17:46:48,833 ErrorMessage.java (line 222) Unexpected exception during request java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) errors, which are spamming our server logs quite a lot (I originally thought this might be caused by KEEPALIVE, which is when I realized that the connections weren’t in keep alive and were building up) - it would be nice if netty would tell us which a little about the Socket channel in the error message (maybe there is a way to do this by changing log levels, but as I say I haven’t had time to go digging there) I will probably file a JIRA issue to add the setting (since I can’t see any particular harm to setting keepalive) On Apr 9, 2014, at 1:34 PM, Michael Shuler mich...@pbandjelly.org wrote: On 04/09/2014 12:41 PM, graham sanderson wrote: Michael, it is not that the connections are being dropped, it is that the connections are not being dropped. Thanks for the clarification. These server side sockets are ESTABLISHED, even though the client connection on the other side of the network device is long gone. This may well be an issue with the network device (it is valiantly trying to keep the connection alive it seems). Have you tested if they *ever* time out on their own, or do they just keep sticking around forever? (maybe 432000 sec (120 hours), which is the default for nf_conntrack_tcp_timeout_established?) Trying out all the usage scenarios is really the way to track it down - directly on switch, behind/in front of firewall, on/off the VPN. That said KEEPALIVE on the server side would not be a bad idea. At least then the OS on the server would eventually (probably after 2 hours of inactivity) attempt to ping the client. At that point hopefully something interesting would happen perhaps causing an error and destroying the server side socket (note KEEPALIVE is also good for preventing idle connections from being dropped by other network devices along the way) Tuning net.ipv4.tcp_keepalive_* could be helpful, if you know they timeout after 2 hours, which is the default. rpc_keepalive on the server sets keep alive on the server side sockets for thrift, and is true by default There doesn’t seem to be a setting for the native protocol Note this isn’t a huge issue for us, they can be cleaned up by a rolling restart, and this particular case is not production, but related to development/testing against alpha by people working remotely over VPN - and it may well be the VPNs fault in this case… that said and maybe this is a dev list question, it seems like the option to set keepalive should exist. Yeah, but I agree you shouldn't have to restart to clean up connections - that's why I think it is lower in the network stack, and that a bit of troubleshooting and tuning might be helpful. That setting sounds like a good Jira request - keepalive may be the default, I'm not sure. :) -- Michael On Apr 9, 2014, at 12:25 PM, Michael Shuler mich...@pbandjelly.org wrote: On 04/09/2014 11:39 AM, graham sanderson wrote: Thanks, but I would think that just sets keep alive from the client end; I’m talking about the server end… this is one of those issues where there is something (e.g. switch, firewall, VPN in between the client and the server) and we get left with orphaned established connections to the server when the client is gone. There would be no server setting for any service, not just c*, that would correct mis-configured connection-assassinating network gear between
Re: Update SSTable fragmentation
I don't believe so. Cassandra still needs to hit the bloom filters for each SST table and then reconcile all versions and all tombstones for any row. That's why overwrites have similar performance impact as tombstones, overwrites just happen to be less common. On Wed, Apr 9, 2014 at 2:42 PM, Wayne Schroeder wschroe...@pinsightmedia.com wrote: I've been doing a lot of reading on SSTable fragmentation due to updates and the costs associated with reconstructing the end data from multiple SSTables that have been created over time and not yet compacted. One question is stuck in my head: If you re-insert entire rows instead of updating one column, will cassandra end flushing that entire row into one SSTable on disk and then end up up finding a non fragmented entire row quickly on reads instead of potential reconstruction across multiple SSTables? Obviously this has implications for space as a trade off. Wayne -- *Ken Hancock *| System Architect, Advanced Advertising SeaChange International 50 Nagog Park Acton, Massachusetts 01720 ken.hanc...@schange.com | www.schange.com | NASDAQ:SEAChttp://www.schange.com/en-US/Company/InvestorRelations.aspx Office: +1 (978) 889-3329 | [image: Google Talk:] ken.hanc...@schange.com | [image: Skype:]hancockks | [image: Yahoo IM:]hancockks [image: LinkedIn]http://www.linkedin.com/in/kenhancock [image: SeaChange International] http://www.schange.com/This e-mail and any attachments may contain information which is SeaChange International confidential. The information enclosed is intended only for the addressees herein and may not be copied or forwarded without permission from SeaChange International.
Re: Multiget performance
Can you trace the query and paste the results? On Wed, Apr 9, 2014 at 11:17 AM, Allan C alla...@gmail.com wrote: As one CQL statement: SELECT * from Event WHERE key IN ([100 keys]); -Allan On April 9, 2014 at 12:52:13 AM, Daniel Chia (danc...@coursera.org) wrote: Are you making the 100 calls in serial, or in parallel? Thanks, Daniel On Tue, Apr 8, 2014 at 11:22 PM, Allan C alla...@gmail.com wrote: Hi all, I've always been told that multigets are a Cassandra anti-pattern for performance reasons. I ran a quick test tonight to prove it to myself, and, sure enough, slowness ensued. It takes about 150ms to get 100 keys for my use case. Not terrible, but at least an order of magnitude from what I need it to be. So far, I've been able to denormalize and not have any problems. Today, I ran into a use case where denormalization introduces a huge amount of complexity to the code. It's very tempting to cache a subset in Redis and call it a day -- probably will. But, that's not a very satisfying answer. It's only about 5GB of data and it feels like I should be able to tune a Cassandra CF to be within 2x. The workload is around 70% reads. Most of the writes are updates to existing data. Currently, it's in an LCS CF with ~30M rows. The cluster is 300GB total with 3-way replication, running across 12 fairly large boxes with 16G RAM. All on SSDs. Striped across 3 AZs in AWS (hi1.4xlarges, fwiw). Has anyone had success getting good results for this kind of workload? Or, is Cassandra just not suited for it at all and I should just use an in-memory store? -Allan -- Tyler Hobbs DataStax http://datastax.com/
Re: Upgrading Cassandra
On Tue, Apr 8, 2014 at 4:39 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Yet, can't we rebuild a new DC with the current C* version, upgrade it to the new major once it is fully part of the C* cluster, and then switch all the clients to the new DC once we are sure everything is ok and shut down the old one ? Yes I mean, on a multiDC setup, while upgrading, there must be a moment that 2 DCs haven't the same major version, this is probably supported. It is supported, you just don't want to do add/remove nodes, run repairs, etc with a mixed cluster. -- Tyler Hobbs DataStax http://datastax.com/
How to replace cluster name without any impact?
We have around 36 node Cassandra cluster and we have three Datacenters. Each datacenter have 12 node. We already have data flowing in Cassandra now and we cannot wipe out all our data now. Considering this - what is the right way to rename the cluster name without any or minimal impact?
Re: How to replace cluster name without any impact?
What version are you running? As of 1.2.x you can do the following: 1. Start the cqlsh connected locally to the node. 2. Run: update system.local set cluster_name='$CLUSTER_NAME' where key='local'; 3. Run nodetool flush on the node. 4. Update the cassandra.yaml file on the node, changing the cluster_name to the same as you set in step 2. 5. Restart the node. Please be aware that you will have two partial clusters until you complete your rolling restart. Also considering that the cluster name is only a cosmetic value my opinion would be to leave it, as the risk far outweighs the benefits of changing it. Mark On Thu, Apr 10, 2014 at 2:49 AM, Check Peck comptechge...@gmail.com wrote: We have around 36 node Cassandra cluster and we have three Datacenters. Each datacenter have 12 node. We already have data flowing in Cassandra now and we cannot wipe out all our data now. Considering this - what is the right way to rename the cluster name without any or minimal impact?