Re: unsubscribe
Raman, To unsubscribe send a mail to user-unsubscr...@cassandra.apache.org On 22 February 2015 at 03:59, Raman ra...@assetcorporation.net wrote: unsubscribe
Re: Running Cassandra + Spark on AWS - architecture questions
I'm not sure if this is a good use case for you, but you might also consider setting up several keyspaces, one for any data you want available for analytics (such as business object tables), and one for data you don't want to do analytics on (such as custom secondary indices). Maybe a third one for data which should only exist in the analytics space, such as for temporary rollup data. This can reduce the amount of data you replicate into your analytics space, and allow you to run a smaller analytics cluster than your production cluster. On Fri, Feb 20, 2015 at 2:43 PM, DuyHai Doan doanduy...@gmail.com wrote: Cassandra would take care of keeping the data synced between the two sets of five nodes. Is that correct? Correct But doing so means that we need 2x as many nodes as we need for the real-time cluster alone Not necessarily. With multi DC you can configure the replication factor value per DC, meaning that you can have RF = 3 for the real time DC and RF=1 or RF=2 for the analytics DC. Thus the number of nodes can be different for each DC In addition, you can also tune the hardware. If the realtime DC is mostly write only for incoming data and read-only from aggregated table, it is less IO intensive than the analytics DC with lot of read write to compute aggregations. On Fri, Feb 20, 2015 at 10:17 PM, Clint Kelly clint.ke...@gmail.com wrote: Hi all, I read the DSE 4.6 documentation and I'm still not 100% sure what a mixed workload Cassandra + Spark installation would look like, especially on AWS. What I gather is that you use OpsCenter to set up the following: - One virtual data center for real-time processing (e.g., ingestion of time-series data, replying to requests for an interactive application) - Another virtual data center for batch analytics (Spark, possibly for machine learning) If I understand this correctly, if I estimate that I need a five-node cluster to handle all of my data, under the system described above, I would have five nodes serving real-time traffic and all of the data replicated in another five nodes that I use for batch processing. Cassandra would take care of keeping the data synced between the two sets of five nodes. Is that correct? I assume the motivation for such a dual-virtual-data-center architecture is to prevent the Spark jobs (which are going to do lots of scans from Cassandra, and maybe run computation on the same machines hosting Cassandra) from disrupting the real-time performance. But doing so means that we need 2x as many nodes as we need for the real-time cluster alone. *Could someone confirm that my interpretation above of what I read about in the DSE documentation is correct?* If my application needs to run analytics on Spark only a few hours a day, might we be better off spending our money to get a bigger Cassandra cluster and then just spin up Spark jobs on EMR for a few hours at night? (I know this is a hard question to answer, since it all depends on the application---just curious if anyone else here has had to make similar tradeoffs.) e.g., maybe instead of having a five-node real-time cluster, we would have an eight-node real-time cluster, and use our remaining budget on EMR jobs. I am curious if anyone has any thoughts / experience about this. Best regards, Clint
Efficient .net client for cassandra
Hi All, We have been able to find our case specific full text which we are analyzing using Staratio Cassandra. It has modified secondary index api which uses lucene indices. The erformace also seems good to me . Still i wanted to ask you gurus 1) Has anybody used Startio and any drawbacks of it 2) We are using .Net as the client to extract data which lacks performance . I am using the tradition connection pooling and then executing the prepared statement. So anybody who is using any specific client for .net would help me on this Thanks in advance for the help Thanks and Regards Asit
Re: C* 2.1.3 - Incremental replacement of compacted SSTables
We had some issues with it right before we wanted to release 2.1.3 so we temporarily(?) disabled it, it *might* get removed entirely in 2.1.4, if you have any input, please comment on this ticket: https://issues.apache.org/jira/browse/CASSANDRA-8833 /Marcus On Sat, Feb 21, 2015 at 7:29 PM, Mark Greene green...@gmail.com wrote: I saw in the NEWS.txt that this has been disabled. Does anyone know why that was the case? Is it temporary just for the 2.1.3 release? Thanks, Mark Greene
Re: run cassandra on a small instance
You might also have some gains setting in_memory_compaction_limit_in_mb to something very low to force Cassandra to use on disk compaction rather than doing it in memory. On 23 February 2015 at 14:12, Tim Dunphy bluethu...@gmail.com wrote: Nate, Definitely thank you for this advice. After leaving the new Cassandra node running on the 2GB instance for the past couple of days, I think I've had ample reason to report complete success in getting it stabilized on that instance! Here are the changes I've been able to make: I think manipulating the key cache and other stuff like concurrent writes and some of the other stuff I worked on based on that thread from the cassandra list definitely was key in getting Cassandra to work on the new instance. Check out the before and after (before working/ after working): Before in cassandra-env.sh: MAX_HEAP_SIZE=800M HEAP_NEWSIZE=200M After: MAX_HEAP_SIZE=512M HEAP_NEWSIZE=100M And before in the cassandra.yaml file: concurrent_writes: 32 compaction_throughput_mb_per_sec: 16 key_cache_size_in_mb: key_cache_save_period: 14400 # native_transport_max_threads: 128 And after: concurrent_writes: 2 compaction_throughput_mb_per_sec: 8 key_cache_size_in_mb: 4 key_cache_save_period: 0 native_transport_max_threads: 4 That really made the difference. I'm a puppet user, so these changes are in puppet. So any new 2GB instances I should bring up on Digital Ocean should absolutely work the way the first 2GB node does, there. But I was able to make enough sense of your chef recipe to adapt what you were showing me. Thanks again! Tim On Fri, Feb 20, 2015 at 10:31 PM, Tim Dunphy bluethu...@gmail.com wrote: The most important things to note: - don't include JNA (it needs to lock pages larger than what will be available) - turn down threadpools for transports - turn compaction throughput way down - make concurrent reads and writes very small I have used the above run a healthy 5 node clusters locally in it's own private network with a 6th monitoring server for light to moderate local testing in 16g of laptop ram. YMMV but it is possible. Thanks!! That was very helpful. I just tried applying your suggestions to my cassandra.yaml file. I used the info from your chef recipe. Well like I've been saying typically it takes about 5 hours or so for this situation to shake itself out. I'll provide an update to the list once I have a better idea of how this is working. Thanks again! Tim On Fri, Feb 20, 2015 at 9:37 PM, Nate McCall n...@thelastpickle.com wrote: I frequently test with multi-node vagrant-based clusters locally. The following chef attributes should give you an idea of what to turn down in cassandra.yaml and cassandra-env.sh to build a decent testing cluster: :cassandra = {'cluster_name' = 'VerifyCluster', 'package_name' = 'dsc20', 'version' = '2.0.11', 'release' = '1', 'setup_jna' = false, 'max_heap_size' = '512M', 'heap_new_size' = '100M', 'initial_token' = server['initial_token'], 'seeds' = 192.168.33.10, 'listen_address' = server['ip'], 'broadcast_address' = server['ip'], 'rpc_address' = server['ip'], 'conconcurrent_reads' = 2, 'concurrent_writes' = 2, 'memtable_flush_queue_size' = 2, 'compaction_throughput_mb_per_sec' = 8, 'key_cache_size_in_mb' = 4, 'key_cache_save_period' = 0, 'native_transport_min_threads' = 2, 'native_transport_max_threads' = 4, 'notify_restart' = true, 'reporter' = { 'riemann' = { 'enable' = true, 'host' = '192.168.33.51' }, 'graphite' = { 'enable' = true, 'host' = '192.168.33.51' } } }, The most important things to note: - don't include JNA (it needs to lock pages larger than what will be available) - turn down threadpools for transports - turn compaction throughput way down - make concurrent reads and writes very small I have used the above run a healthy 5 node clusters locally in it's own private network with a 6th monitoring server for light to moderate local testing in 16g of laptop ram. YMMV but it is possible. -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B --
Re: run cassandra on a small instance
Nate, Definitely thank you for this advice. After leaving the new Cassandra node running on the 2GB instance for the past couple of days, I think I've had ample reason to report complete success in getting it stabilized on that instance! Here are the changes I've been able to make: I think manipulating the key cache and other stuff like concurrent writes and some of the other stuff I worked on based on that thread from the cassandra list definitely was key in getting Cassandra to work on the new instance. Check out the before and after (before working/ after working): Before in cassandra-env.sh: MAX_HEAP_SIZE=800M HEAP_NEWSIZE=200M After: MAX_HEAP_SIZE=512M HEAP_NEWSIZE=100M And before in the cassandra.yaml file: concurrent_writes: 32 compaction_throughput_mb_per_sec: 16 key_cache_size_in_mb: key_cache_save_period: 14400 # native_transport_max_threads: 128 And after: concurrent_writes: 2 compaction_throughput_mb_per_sec: 8 key_cache_size_in_mb: 4 key_cache_save_period: 0 native_transport_max_threads: 4 That really made the difference. I'm a puppet user, so these changes are in puppet. So any new 2GB instances I should bring up on Digital Ocean should absolutely work the way the first 2GB node does, there. But I was able to make enough sense of your chef recipe to adapt what you were showing me. Thanks again! Tim On Fri, Feb 20, 2015 at 10:31 PM, Tim Dunphy bluethu...@gmail.com wrote: The most important things to note: - don't include JNA (it needs to lock pages larger than what will be available) - turn down threadpools for transports - turn compaction throughput way down - make concurrent reads and writes very small I have used the above run a healthy 5 node clusters locally in it's own private network with a 6th monitoring server for light to moderate local testing in 16g of laptop ram. YMMV but it is possible. Thanks!! That was very helpful. I just tried applying your suggestions to my cassandra.yaml file. I used the info from your chef recipe. Well like I've been saying typically it takes about 5 hours or so for this situation to shake itself out. I'll provide an update to the list once I have a better idea of how this is working. Thanks again! Tim On Fri, Feb 20, 2015 at 9:37 PM, Nate McCall n...@thelastpickle.com wrote: I frequently test with multi-node vagrant-based clusters locally. The following chef attributes should give you an idea of what to turn down in cassandra.yaml and cassandra-env.sh to build a decent testing cluster: :cassandra = {'cluster_name' = 'VerifyCluster', 'package_name' = 'dsc20', 'version' = '2.0.11', 'release' = '1', 'setup_jna' = false, 'max_heap_size' = '512M', 'heap_new_size' = '100M', 'initial_token' = server['initial_token'], 'seeds' = 192.168.33.10, 'listen_address' = server['ip'], 'broadcast_address' = server['ip'], 'rpc_address' = server['ip'], 'conconcurrent_reads' = 2, 'concurrent_writes' = 2, 'memtable_flush_queue_size' = 2, 'compaction_throughput_mb_per_sec' = 8, 'key_cache_size_in_mb' = 4, 'key_cache_save_period' = 0, 'native_transport_min_threads' = 2, 'native_transport_max_threads' = 4, 'notify_restart' = true, 'reporter' = { 'riemann' = { 'enable' = true, 'host' = '192.168.33.51' }, 'graphite' = { 'enable' = true, 'host' = '192.168.33.51' } } }, The most important things to note: - don't include JNA (it needs to lock pages larger than what will be available) - turn down threadpools for transports - turn compaction throughput way down - make concurrent reads and writes very small I have used the above run a healthy 5 node clusters locally in it's own private network with a 6th monitoring server for light to moderate local testing in 16g of laptop ram. YMMV but it is possible. -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B -- GPG me!! gpg --keyserver pool.sks-keyservers.net --recv-keys F186197B