Re: 10000+ CF support from Cassandra
How big is each of the tables - are they all fairly small or fairly large? Small as in no more than thousands of rows or large as in tens of millions or hundreds of millions of rows? Small tables are are not ideal for a Cassandra cluster since the rows would be spread out across the nodes, even though it might make more sense for each small table to be on a single node. You might want to consider a model where you have an application layer that maps logical tenant tables into partition keys within a single large Casandra table, or at least a relatively small number of Cassandra tables. It will depend on the typical size of your tenant tables - very small ones would make sense within a single partition, while larger ones should have separate partitions for a tenant's data. The key here is that tables are expensive, but partitions are cheap and scale very well with Cassandra. Finally, you said 10 clusters, but did you mean 10 nodes? You might want to consider a model where you do indeed have multiple clusters, where each handles a fraction of the tenants, since there is no need for separate tenants to be on the same cluster. -- Jack Krupansky On Tue, May 26, 2015 at 11:32 PM, Arun Chaitanya chaitan64a...@gmail.com wrote: Good Day Everyone, I am very happy with the (almost) linear scalability offered by C*. We had a lot of problems with RDBMS. But, I heard that C* has a limit on number of column families that can be created in a single cluster. The reason being each CF stores 1-2 MB on the JVM heap. In our use case, we have about 1+ CF and we want to support multi-tenancy. (i.e 1 * no of tenants) We are new to C* and being from RDBMS background, I would like to understand how to tackle this scenario from your advice. Our plan is to use Off-Heap memtable approach. http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1 Each node in the cluster has following configuration 16 GB machine (8GB Cassandra JVM + 2GB System + 6GB Off-Heap) IMO, this should be able to support 1000 CF with no(very less) impact on performance and startup time. We tackle multi-tenancy using different keyspaces.(Solution I found on the web) Using this approach we can have 10 clusters doing the job. (We actually are worried about the cost) Can you please help us evaluate this strategy? I want to hear communities opinion on this. My major concerns being, 1. Is Off-Heap strategy safe and my assumption of 16 GB supporting 1000 CF right? 2. Can we use multiple keyspaces to solve multi-tenancy? IMO, the number of column families increase even when we use multiple keyspace. 3. I understand the complexity using multi-cluster for single application. The code base will get tightly coupled with infrastructure. Is this the right approach? Any suggestion is appreciated. Thanks, Arun
Re: Cassandra 1.2.x EOL date
On Wed, May 27, 2015 at 5:10 PM, Jason Unovitch jason.unovi...@gmail.com wrote: Simple and quick question, can anyone point me to where the Cassandra 1.2.x series EOL date was announced? I see archived mailing list threads for 1.2.19 mentioning it was going to be the last release and I see CVE-2015-0225 mention it is EOL. I didn't see it say when the official EOL date was. When version n+2 is released, version n is EOL. Release dates for new versions are not generally known in advance. 2.1.5 has been released, so 1.2.19 (2.1, 2.0, 1.2 = 1.2 is current -2) is EOL for the 1.2 branch as of its release date. In some very very rare case, it is theoretically possible that the most recent version of an EOL branch might get a patch. =Rob
Re: Cassandra seems to replace existing node without specifying replace_address
On Thu, May 28, 2015 at 2:00 AM, Thomas Whiteway thomas.white...@metaswitch.com wrote: Sorry, I should have been clearer. In this case we’ve decommissioned the node and deleted the data, commitlog, and saved caches directories so we’re not hitting CASSANDRA-8801. We also hit the “A node with address address already exists, cancelling join” error when performing the same steps on 2.1.0, just not in 2.1.4. Is auto_bootstrap set to true or false? Are you using vnodes, or is initial_token populated? =Rob
Re: Spark SQL JDBC Server + DSE
Mohammed, This doesn¹t really answer your question, but I¹m working on a new REST server that allows people to submit SQL queries over REST, which get executed via Spark SQL. Based on what I started here: http://brianoneill.blogspot.com/2015/05/spark-sql-against-cassandra-example. html I assume you need JDBC connectivity specifically? -brian --- Brian O'Neill Chief Technology Officer Health Market Science, a LexisNexis Company 215.588.6024 Mobile @boneill42 http://www.twitter.com/boneill42 This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. From: Mohammed Guller moham...@glassbeam.com Reply-To: user@cassandra.apache.org Date: Thursday, May 28, 2015 at 8:26 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: RE: Spark SQL JDBC Server + DSE Anybody out there using DSE + Spark SQL JDBC server? Mohammed From: Mohammed Guller [mailto:moham...@glassbeam.com] Sent: Tuesday, May 26, 2015 6:17 PM To: user@cassandra.apache.org Subject: Spark SQL JDBC Server + DSE Hi As I understand, the Spark SQL Thrift/JDBC server cannot be used with the open source C*. Only DSE supports the Spark SQL JDBC server. We would like to find out whether how many organizations are using this combination. If you do use DSE + Spark SQL JDBC server, it would be great if you could share your experience. For example, what kind of issues you have run into? How is the performance? What reporting tools you are using? Thank you! Mohammed
what this error mean
I have a 25 noedes C* cluster with C* 2.1.3. These days a node occur split brain many times。 check the log I found this: INFO [MemtableFlushWriter:118] 2015-05-29 08:07:39,176 Memtable.java:378 - Completed flushing /home/ant/apache-cassandra-2.1.3/bin/../data/data/system/sstable_activity-5a1ff2 67ace03f128563cfae6103c65e/system-sstable_activity-ka-4371-Data.db (8187 bytes) for commitlog position ReplayPosition(segmentId=1432775133526, position=16684949) ERROR [IndexSummaryManager:1] 2015-05-29 08:10:30,209 CassandraDaemon.java:167 - Exception in thread Thread[IndexSummaryManager:1,1,main] java.lang.AssertionError: null at org.apache.cassandra.io.sstable.SSTableReader.cloneWithNewSummarySamplingLevel(SSTableReader.java:921) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.io.sstable.IndexSummaryManager.adjustSamplingLevels(IndexSummaryManager.java:410) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.io.sstable.IndexSummaryManager.redistributeSummaries(IndexSummaryManager.java:288) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.io.sstable.IndexSummaryManager.redistributeSummaries(IndexSummaryManager.java:238) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.io.sstable.IndexSummaryManager$1.runMayThrow(IndexSummaryManager.java:139) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-2.1.3.jar:2.1.3] at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:82) ~[apache-cassandra-2. 1.3.jar:2.1.3] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_71] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [na:1.7.0_71] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_71] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_71] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71] I want to know why this and how to fix this Thanks all -- Ranger Tsao
RE: Spark SQL JDBC Server + DSE
Anybody out there using DSE + Spark SQL JDBC server? Mohammed From: Mohammed Guller [mailto:moham...@glassbeam.com] Sent: Tuesday, May 26, 2015 6:17 PM To: user@cassandra.apache.org Subject: Spark SQL JDBC Server + DSE Hi - As I understand, the Spark SQL Thrift/JDBC server cannot be used with the open source C*. Only DSE supports the Spark SQL JDBC server. We would like to find out whether how many organizations are using this combination. If you do use DSE + Spark SQL JDBC server, it would be great if you could share your experience. For example, what kind of issues you have run into? How is the performance? What reporting tools you are using? Thank you! Mohammed
Re: 10000+ CF support from Cassandra
Hello Jack, Column families? As opposed to tables? Are you using Thrift instead of CQL3? You should be focusing on the latter, not the former. We have an ORM developed in our company, which maps each DTO to a column family. So, we have many column families. We are using CQL3. But either way, the general guidance is that there is no absolute limit of tables per se, but low hundreds is the recommended limit, regardless of whether how many key spaces they may be divided between. More than that is an anti-pattern for Cassandra - maybe you can make it work for your application, but it isn't recommended. You want to say that most cassandra users don't have more than 2-300 column families? Is this achieved through careful data modelling? A successful Cassandra deployment is critically dependent on careful data modeling - who is responsible for modeling each of these tables, you and a single, tightly-knit team with very common interests and very specific goals and SLAs or many different developers with different managers with different goals such as SLAs? The latter. When you say multi-tenant, are you simply saying that each of your organization's customers has their data segregated, or does each customer have direct access to the cluster? Each organization's data is in the same cluster. No customer doesn't have access to the cluster. Thanks, Arun On Wed, May 27, 2015 at 7:17 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Scalability of Cassandra refers primarily to number of rows and number of nodes - to add more data, add more nodes. Column families? As opposed to tables? Are you using Thrift instead of CQL3? You should be focusing on the latter, not the former. But either way, the general guidance is that there is no absolute limit of tables per se, but low hundreds is the recommended limit, regardless of whether how many key spaces they may be divided between. More than that is an anti-pattern for Cassandra - maybe you can make it work for your application, but it isn't recommended. A successful Cassandra deployment is critically dependent on careful data modeling - who is responsible for modeling each of these tables, you and a single, tightly-knit team with very common interests and very specific goals and SLAs or many different developers with different managers with different goals such as SLAs? When you say multi-tenant, are you simply saying that each of your organization's customers has their data segregated, or does each customer have direct access to the cluster? -- Jack Krupansky On Tue, May 26, 2015 at 11:32 PM, Arun Chaitanya chaitan64a...@gmail.com wrote: Good Day Everyone, I am very happy with the (almost) linear scalability offered by C*. We had a lot of problems with RDBMS. But, I heard that C* has a limit on number of column families that can be created in a single cluster. The reason being each CF stores 1-2 MB on the JVM heap. In our use case, we have about 1+ CF and we want to support multi-tenancy. (i.e 1 * no of tenants) We are new to C* and being from RDBMS background, I would like to understand how to tackle this scenario from your advice. Our plan is to use Off-Heap memtable approach. http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1 Each node in the cluster has following configuration 16 GB machine (8GB Cassandra JVM + 2GB System + 6GB Off-Heap) IMO, this should be able to support 1000 CF with no(very less) impact on performance and startup time. We tackle multi-tenancy using different keyspaces.(Solution I found on the web) Using this approach we can have 10 clusters doing the job. (We actually are worried about the cost) Can you please help us evaluate this strategy? I want to hear communities opinion on this. My major concerns being, 1. Is Off-Heap strategy safe and my assumption of 16 GB supporting 1000 CF right? 2. Can we use multiple keyspaces to solve multi-tenancy? IMO, the number of column families increase even when we use multiple keyspace. 3. I understand the complexity using multi-cluster for single application. The code base will get tightly coupled with infrastructure. Is this the right approach? Any suggestion is appreciated. Thanks, Arun
Re: Start with single node, move to 3-node cluster
hmm..i supposed you start with rf = 1 and then when 3n arrived, just add into the cluster and later decomission this one node? http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_remove_node_t.html hth jason On Tue, May 26, 2015 at 10:02 PM, Matthew Johnson matt.john...@algomi.com wrote: Hi Jason, When the 3N cluster is up and running, I need to get the data from SN into the 3N cluster and then give the SN server back. So I need to keep the data, but on completely new servers – just trying to work out what the best way of doing that is. The volume of data that needs migrating won’t be huge, probably about 30G, but it is data that I definitely need to keep (for historical analysis, audit etc). Thanks! Matthew *From:* Jason Wee [mailto:peich...@gmail.com] *Sent:* 26 May 2015 14:38 *To:* user@cassandra.apache.org *Subject:* Re: Start with single node, move to 3-node cluster will you add this lent one node into the 3N to form a cluster? but really , if you are just started, you could use this one node for your learning by installing multiple instances for experiments or development purposes only. imho, in the long run, this proove to be very valuable, as least for me. with this single node, you can easily simulate like c* upgrade. for instance, c* right now is at 2.1.5, when 2.2 went stable, you can test using your multiple instances on this single node to simulate your production environment safely. hth jason On Tue, May 26, 2015 at 9:24 PM, Matthew Johnson matt.john...@algomi.com wrote: Hi gurus, We have ordered some hardware for a 3-node cluster, but its ETA is 6 to 8 weeks. In the meantime, I have been lent a single server that I can use. I am wondering what the best way is to set up my single node (SN), so I can then move to the 3-node cluster (3N) when the hardware arrives. Do I: 1. Create my keyspaces on SN with RF=1, and when 3N is up and running migrate all the data manually (either through Spark, dump-and-load, or write a small script)? 2. Create my keyspaces on SN with RF=3, bootstrap the 3N nodes into a 4-node cluster when they’re ready, then remove SN from the cluster? 3. Use SN as normal, and when 3N hardware arrives, physically move the data folder and commit log folder onto one of the nodes in 3N and start it up as a seed? 4. Any other recommended solutions? I’m not even sure what the impact would be of running a single node with RF=3 – would this even work? Any ideas would be much appreciated. Thanks! Matthew
RE: Cassandra seems to replace existing node without specifying replace_address
Sorry, I should have been clearer. In this case we’ve decommissioned the node and deleted the data, commitlog, and saved caches directories so we’re not hitting CASSANDRA-8801. We also hit the “A node with address address already exists, cancelling join” error when performing the same steps on 2.1.0, just not in 2.1.4. Thomas From: Robert Coli [mailto:rc...@eventbrite.com] Sent: 27 May 2015 20:41 To: user@cassandra.apache.org Subject: Re: Cassandra seems to replace existing node without specifying replace_address On Wed, May 27, 2015 at 5:48 AM, Thomas Whiteway thomas.white...@metaswitch.commailto:thomas.white...@metaswitch.com wrote: I’ve been investigating using replace_address to replace a node that hasn’t left the cluster cleanly and after upgrading from 2.1.0 to 2.1.4 it seems that adding a new node will automatically replace an existing node with the same IP address even if replace_address isn’t used. Does anyone know whether this is an expected change (as far as I can tell it doesn’t seem to be)? This is a longstanding known issue (Cassandra has had this behavior since the inception of decom), with a fix recently (May 19, 2015) merged to trunk. https://issues.apache.org/jira/browse/CASSANDRA-8801 The basic problem is that the node does not forget its own cluster membership information, and so joins the cluster using its stored tokens. In my opinion, decommission should wipe all stored node state, but 8801 creates a workaround that addresses this, the worst case. =Rob
Re: 10000+ CF support from Cassandra
Depending on your use case and data types (for example if you can have a minimally Nested Json representation of the objects; Than you could go with a common mapstring,string representation where keys are top love object fields and values are valid Json literals as strings; eg unquoted primitives, quoted strings, unquoted arrays or other objects Each top level field is then independently updatable - which may be beneficial (and allows you to trivially keep historical versions of objects of that is a requirement) If you are updating the object in its entirety on save then simply store the entire object in a single cql field, and denormalize any search fields you may need (which you kinda want to do anyway) Sent from my iPhone On May 28, 2015, at 1:49 AM, Arun Chaitanya chaitan64a...@gmail.com wrote: Hello Jack, Column families? As opposed to tables? Are you using Thrift instead of CQL3? You should be focusing on the latter, not the former. We have an ORM developed in our company, which maps each DTO to a column family. So, we have many column families. We are using CQL3. But either way, the general guidance is that there is no absolute limit of tables per se, but low hundreds is the recommended limit, regardless of whether how many key spaces they may be divided between. More than that is an anti-pattern for Cassandra - maybe you can make it work for your application, but it isn't recommended. You want to say that most cassandra users don't have more than 2-300 column families? Is this achieved through careful data modelling? A successful Cassandra deployment is critically dependent on careful data modeling - who is responsible for modeling each of these tables, you and a single, tightly-knit team with very common interests and very specific goals and SLAs or many different developers with different managers with different goals such as SLAs? The latter. When you say multi-tenant, are you simply saying that each of your organization's customers has their data segregated, or does each customer have direct access to the cluster? Each organization's data is in the same cluster. No customer doesn't have access to the cluster. Thanks, Arun On Wed, May 27, 2015 at 7:17 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Scalability of Cassandra refers primarily to number of rows and number of nodes - to add more data, add more nodes. Column families? As opposed to tables? Are you using Thrift instead of CQL3? You should be focusing on the latter, not the former. But either way, the general guidance is that there is no absolute limit of tables per se, but low hundreds is the recommended limit, regardless of whether how many key spaces they may be divided between. More than that is an anti-pattern for Cassandra - maybe you can make it work for your application, but it isn't recommended. A successful Cassandra deployment is critically dependent on careful data modeling - who is responsible for modeling each of these tables, you and a single, tightly-knit team with very common interests and very specific goals and SLAs or many different developers with different managers with different goals such as SLAs? When you say multi-tenant, are you simply saying that each of your organization's customers has their data segregated, or does each customer have direct access to the cluster? -- Jack Krupansky On Tue, May 26, 2015 at 11:32 PM, Arun Chaitanya chaitan64a...@gmail.com wrote: Good Day Everyone, I am very happy with the (almost) linear scalability offered by C*. We had a lot of problems with RDBMS. But, I heard that C* has a limit on number of column families that can be created in a single cluster. The reason being each CF stores 1-2 MB on the JVM heap. In our use case, we have about 1+ CF and we want to support multi-tenancy. (i.e 1 * no of tenants) We are new to C* and being from RDBMS background, I would like to understand how to tackle this scenario from your advice. Our plan is to use Off-Heap memtable approach. http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1 Each node in the cluster has following configuration 16 GB machine (8GB Cassandra JVM + 2GB System + 6GB Off-Heap) IMO, this should be able to support 1000 CF with no(very less) impact on performance and startup time. We tackle multi-tenancy using different keyspaces.(Solution I found on the web) Using this approach we can have 10 clusters doing the job. (We actually are worried about the cost) Can you please help us evaluate this strategy? I want to hear communities opinion on this. My major concerns being, 1. Is Off-Heap strategy safe and my assumption of 16 GB supporting 1000 CF right? 2. Can we use multiple keyspaces to solve multi-tenancy? IMO, the number of column families increase even when we use multiple keyspace. 3. I
cassandra.WriteTimeout: code=1100 [Coordinator node timed out waiting for replica nodes' responses]
Hi I'm running Cassandra 2.1.5 ,(single datacenter ,4 node,16GB vps each node ),I have given my configuration below, I'm using python driver on my clients ,when i tried to insert 1049067 items I got an error. cassandra.WriteTimeout: code=1100 [Coordinator node timed out waiting for replica nodes' responses] message=Operation timed out - received only 0 responses. info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'} also I'm getting an error when I check the count of CF from cqlsh OperationTimedOut: errors={}, last_host=127.0.0.1 I've installed Datastax Ops center for monitoring nodes , it shows about 1000/sec write request even the cluster is idle . Is there any problem with my cassandra configuration cluster_name: 'Test Cluster' num_tokens: 256 hinted_handoff_enabled: true . max_hint_window_in_ms: 1080 # 3 hours hinted_handoff_throttle_in_kb: 1024 max_hints_delivery_threads: 2 batchlog_replay_throttle_in_kb: 1024 authenticator: AllowAllAuthenticator authorizer: AllowAllAuthorizer permissions_validity_in_ms: 2000 partitioner: org.apache.cassandra.dht.Murmur3Partitioner data_file_directories: - /var/lib/cassandra/data commitlog_directory: /var/lib/cassandra/commitlog disk_failure_policy: stop commit_failure_policy: stop key_cache_size_in_mb: key_cache_save_period: 14400 row_cache_size_in_mb: 0 row_cache_save_period: 0 counter_cache_size_in_mb: counter_cache_save_period: 7200 saved_caches_directory: /var/lib/cassandra/saved_caches commitlog_sync: periodic commitlog_sync_period_in_ms: 1 commitlog_segment_size_in_mb: 32 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: 10.xx.xx.xx,10.xx.xx.xxx concurrent_reads: 45 concurrent_writes: 64 concurrent_counter_writes: 32 memtable_allocation_type: heap_buffers memtable_flush_writers: 6 index_summary_capacity_in_mb: index_summary_resize_interval_in_minutes: 60 trickle_fsync: false trickle_fsync_interval_in_kb: 10240 storage_port: 7000 ssl_storage_port: 7001 listen_address: 10.xx.x.xxx start_native_transport: true native_transport_port: 9042 rpc_address: 0.0.0.0 rpc_port: 9160 broadcast_rpc_address: 10.xx.x.xxx rpc_keepalive: true rpc_server_type: sync thrift_framed_transport_size_in_mb: 15 incremental_backups: false snapshot_before_compaction: false auto_snapshot: true tombstone_warn_threshold: 1000 tombstone_failure_threshold: 10 column_index_size_in_kb: 64 batch_size_warn_threshold_in_kb: 5 compaction_throughput_mb_per_sec: 16 sstable_preemptive_open_interval_in_mb: 50 read_request_timeout_in_ms: 5000 range_request_timeout_in_ms: 1 write_request_timeout_in_ms: 2000 counter_write_request_timeout_in_ms: 5000 cas_contention_timeout_in_ms: 1000 truncate_request_timeout_in_ms: 6 request_timeout_in_ms: 1 cross_node_timeout: false endpoint_snitch: SimpleSnitch dynamic_snitch_update_interval_in_ms: 100 dynamic_snitch_reset_interval_in_ms: 60 dynamic_snitch_badness_threshold: 0.1 request_scheduler: org.apache.cassandra.scheduler.NoScheduler server_encryption_options: internode_encryption: none keystore: conf/.keystore keystore_password: cassandra truststore: conf/.truststore truststore_password: cassandra # enable or disable client/server encryption. client_encryption_options: enabled: false keystore: conf/.keystore keystore_password: cassandra internode_compression: all inter_dc_tcp_nodelay: false
Re: cassandra.WriteTimeout: code=1100 [Coordinator node timed out waiting for replica nodes' responses]
I have experienced similar results: OperationTimedOut after inserting many millions of records on a 5 nodes cluster, using Cassandra 2.1.5. I rolled back to 2.1.4 using identically the same configuration as with 2.1.5 and these timeout went away… This is not the solution to your problem but just to say that for me 2.1.5 seems to saturate when bulk inserting many million records. On 28 May 2015, at 15:15 , Sachin PK sachinpray...@gmail.com wrote: Hi I'm running Cassandra 2.1.5 ,(single datacenter ,4 node,16GB vps each node ),I have given my configuration below, I'm using python driver on my clients ,when i tried to insert 1049067 items I got an error. cassandra.WriteTimeout: code=1100 [Coordinator node timed out waiting for replica nodes' responses] message=Operation timed out - received only 0 responses. info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'} also I'm getting an error when I check the count of CF from cqlsh OperationTimedOut: errors={}, last_host=127.0.0.1 I've installed Datastax Ops center for monitoring nodes , it shows about 1000/sec write request even the cluster is idle . Is there any problem with my cassandra configuration cluster_name: 'Test Cluster' num_tokens: 256 hinted_handoff_enabled: true . max_hint_window_in_ms: 1080 # 3 hours hinted_handoff_throttle_in_kb: 1024 max_hints_delivery_threads: 2 batchlog_replay_throttle_in_kb: 1024 authenticator: AllowAllAuthenticator authorizer: AllowAllAuthorizer permissions_validity_in_ms: 2000 partitioner: org.apache.cassandra.dht.Murmur3Partitioner data_file_directories: - /var/lib/cassandra/data commitlog_directory: /var/lib/cassandra/commitlog disk_failure_policy: stop commit_failure_policy: stop key_cache_size_in_mb: key_cache_save_period: 14400 row_cache_size_in_mb: 0 row_cache_save_period: 0 counter_cache_size_in_mb: counter_cache_save_period: 7200 saved_caches_directory: /var/lib/cassandra/saved_caches commitlog_sync: periodic commitlog_sync_period_in_ms: 1 commitlog_segment_size_in_mb: 32 seed_provider: - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: - seeds: 10.xx.xx.xx,10.xx.xx.xxx concurrent_reads: 45 concurrent_writes: 64 concurrent_counter_writes: 32 memtable_allocation_type: heap_buffers memtable_flush_writers: 6 index_summary_capacity_in_mb: index_summary_resize_interval_in_minutes: 60 trickle_fsync: false trickle_fsync_interval_in_kb: 10240 storage_port: 7000 ssl_storage_port: 7001 listen_address: 10.xx.x.xxx start_native_transport: true native_transport_port: 9042 rpc_address: 0.0.0.0 rpc_port: 9160 broadcast_rpc_address: 10.xx.x.xxx rpc_keepalive: true rpc_server_type: sync thrift_framed_transport_size_in_mb: 15 incremental_backups: false snapshot_before_compaction: false auto_snapshot: true tombstone_warn_threshold: 1000 tombstone_failure_threshold: 10 column_index_size_in_kb: 64 batch_size_warn_threshold_in_kb: 5 compaction_throughput_mb_per_sec: 16 sstable_preemptive_open_interval_in_mb: 50 read_request_timeout_in_ms: 5000 range_request_timeout_in_ms: 1 write_request_timeout_in_ms: 2000 counter_write_request_timeout_in_ms: 5000 cas_contention_timeout_in_ms: 1000 truncate_request_timeout_in_ms: 6 request_timeout_in_ms: 1 cross_node_timeout: false endpoint_snitch: SimpleSnitch dynamic_snitch_update_interval_in_ms: 100 dynamic_snitch_reset_interval_in_ms: 60 dynamic_snitch_badness_threshold: 0.1 request_scheduler: org.apache.cassandra.scheduler.NoScheduler server_encryption_options: internode_encryption: none keystore: conf/.keystore keystore_password: cassandra truststore: conf/.truststore truststore_password: cassandra # enable or disable client/server encryption. client_encryption_options: enabled: false keystore: conf/.keystore keystore_password: cassandra internode_compression: all inter_dc_tcp_nodelay: false
Re: 10000+ CF support from Cassandra
While Graham's suggestion will let you collapse a bunch of tables into a single one, it'll likely result in so many other problems it won't be worth the effort. I strongly advise against this approach. First off, different workloads need different tuning. Compaction strategies, gc_grace_seconds, garbage collection, etc. This is very workload specific and you'll quickly find that fixing one person's problem will negatively impact someone else. Nested JSON using maps will not lead to a good data model, from a performance perspective and will limited your flexibility. As CQL becomes more expressive you'll miss out on its querying potential as well as the ability to *easily* query those tables from tools like Spark. You'll also hit the limit of the number of elements in a map, which to my knowledge still exists in current C* versions. If you're truly dealing with a lot of data, you'll be managing one cluster that is thousands of nodes. Managing clusters 1k is territory that only a handful of people in the world are familiar with. Even the guys at Netflix stick to a couple hundred. Managing multi tenancy for a hundred clients each with different version requirements will be a nightmare from a people perspective. You'll need everyone to be in sync when you upgrade your cluster. This is just a mess, people are in general, pretty bad at this type of thing. Coordinating a hundred application upgrades (say, to use a newer driver version) is pretty much impossible. off heap in 2.1 isn't fully off heap. Read http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1 for details. If you hit any performance issues, GC, etc, you will take down your entire business instead of just a small portion. Everyone will be impacting everyone else. 1 app's tombstones will cause compaction problems for everyone using that table and be a disaster to try to fix. A side note: you can get away with more than 8GB of memory if you use G1GC. In fact, it only really works if you use 8GB. Using ParNew CMS, tuning the JVM is a different story. The following 2 pages are a good read if you're interested in such details. https://issues.apache.org/jira/browse/CASSANDRA-8150 http://blakeeggleston.com/cassandra-tuning-the-jvm-for-read-heavy-workloads.html My recommendation: Separate your concerns, put each (or a handful) of applications on each cluster and maintain multiple clusters. Put each application in a different keyspace, model normally. If you need to move an app off onto it's own cluster, do so via setting up a second DC for that keyspace, replicate, then shift over. Jon On Thu, May 28, 2015 at 3:06 AM Graham Sanderson gra...@vast.com wrote: Depending on your use case and data types (for example if you can have a minimally Nested Json representation of the objects; Than you could go with a common mapstring,string representation where keys are top love object fields and values are valid Json literals as strings; eg unquoted primitives, quoted strings, unquoted arrays or other objects Each top level field is then independently updatable - which may be beneficial (and allows you to trivially keep historical versions of objects of that is a requirement) If you are updating the object in its entirety on save then simply store the entire object in a single cql field, and denormalize any search fields you may need (which you kinda want to do anyway) Sent from my iPhone On May 28, 2015, at 1:49 AM, Arun Chaitanya chaitan64a...@gmail.com wrote: Hello Jack, Column families? As opposed to tables? Are you using Thrift instead of CQL3? You should be focusing on the latter, not the former. We have an ORM developed in our company, which maps each DTO to a column family. So, we have many column families. We are using CQL3. But either way, the general guidance is that there is no absolute limit of tables per se, but low hundreds is the recommended limit, regardless of whether how many key spaces they may be divided between. More than that is an anti-pattern for Cassandra - maybe you can make it work for your application, but it isn't recommended. You want to say that most cassandra users don't have more than 2-300 column families? Is this achieved through careful data modelling? A successful Cassandra deployment is critically dependent on careful data modeling - who is responsible for modeling each of these tables, you and a single, tightly-knit team with very common interests and very specific goals and SLAs or many different developers with different managers with different goals such as SLAs? The latter. When you say multi-tenant, are you simply saying that each of your organization's customers has their data segregated, or does each customer have direct access to the cluster? Each organization's data is in the same cluster. No customer doesn't have access to the cluster. Thanks, Arun On Wed, May 27, 2015 at 7:17 PM, Jack Krupansky