Custom data types and dynamic tables
Hello, If i have a custom type EventDefinition and i create a table like create table TestTable { user_id long, ts timestamp, definition 'com.anishek.EventDefinition', Primary Key (user_id, ts)) with clustering order by (ts desc) and compression={'sstable_compression' : 'SnappyCompressor'} and compaction = {'class': 'DateTieredCompactionStrategy', 'base_time_seconds':'3600', 'max_sstable_age_days':'30'}; then how would the data stored internally, based on http://www.datastax.com/dev/blog/thrift-to-cql3 Dynamic Column Family section would the data be as below given EventDefinition results in storing the following string representation code, eventName Row : Columns 1 : 2015-03-02 14:33:14+=12,a 2015-03-02 14:34:14+=11,b 2015-03-02 14:35:14+=15,e 2015-03-02 14:36:14+=17,c 2015-03-02 14:37:14+=1,d 2 : 2015-03-02 14:33:14+=12,a 2015-03-02 14:34:14+=11,b 2015-03-02 14:35:14+=15,e 2015-03-02 14:36:14+=17,c 2015-03-02 14:37:14+=1,d Is the above correct ? We will be getting above events per day for a user and new users keep getting added to the system. We are presently assuming that we might have about 30 events at max for a given user. Given we will add data like insert into TestTable(user_id,ts, definition) values(a, 2015-03-02 12:30:56, 1,s) using ttl [30 days]; I am assuming that date tired compaction will not be very effective if the timestamp is not in the same timezone across entries. thanks Anishek
Re: Disastrous profusion of SSTables
Are you frequently updating same rows ? What is the memtable flush size ? can you post the table create query here in please. On Thu, Mar 26, 2015 at 1:21 PM, Dave Galbraith david92galbra...@gmail.com wrote: Hey! So I'm running Cassandra 2.1.2 and using the SizeTieredCompactionStrategy. I'm doing about 3k writes/sec on a single node. My read performance is terrible, all my queries just time out. So I do nodetool cfstats: Read Count: 42071 Read Latency: 67.47804242827601 ms. Write Count: 131964300 Write Latency: 0.011721604274792501 ms. Pending Flushes: 0 Table: metrics16513 SSTable count: 641 Space used (live): 6366740812 Space used (total): 6366740812 Space used by snapshots (total): 0 SSTable Compression Ratio: 0.25272488401992765 Memtable cell count: 0 Memtable data size: 0 Memtable switch count: 1016 Local read count: 42071 Local read latency: 67.479 ms Local write count: 131964300 Local write latency: 0.012 ms Pending flushes: 0 Bloom filter false positives: 994 Bloom filter false ratio: 0.0 Bloom filter space used: 37840376 Compacted partition minimum bytes: 104 Compacted partition maximum bytes: 24601 Compacted partition mean bytes: 255 Average live cells per slice (last five minutes): 111.67243951154147 Maximum live cells per slice (last five minutes): 1588.0 Average tombstones per slice (last five minutes): 0.0 Maximum tombstones per slice (last five minutes): 0.0 and nodetool cfhistograms: Percentile SSTables Write Latency Read LatencyPartition SizeCell Count (micros) (micros) (bytes) 50%46.00 6.99 154844.95 149 1 75% 430.00 8.533518837.53 179 1 95% 430.00 11.327252897.25 215 2 98% 430.00 15.54 22103886.34 215 3 99% 430.00 29.86 22290608.19 159750 Min 0.00 1.66 26.91 104 0 Max 430.00 269795.38 27311364.89 24601 924 Gross!! There are 641 SSTables in there, and all my reads are hitting hundreds of them and timing out. How could this possibly have happened, and what can I do about it? Nodetool compactionstats says pending tasks: 0, by the way. Thanks!
Re: Replication to second data center with different number of nodes
Colin, When you said larger number of tokens has Query performance hit, is it read or write performance. Also if you have any links you could share to shed some light on this it would be great. Thanks Anishek On Sun, Mar 29, 2015 at 2:20 AM, Colin Clark co...@clark.ws wrote: I typically use a # a lot lower than 256, usually less than 20 for num_tokens as a larger number has historically had a dramatic impact on query performance. — Colin Clark co...@clark.ws +1 612-859-6129 skype colin.p.clark On Mar 28, 2015, at 3:46 PM, Eric Stevens migh...@gmail.com wrote: If you're curious about how Cassandra knows how to replicate data in the remote DC, it's the same as in the local DC, replication is independent in each, and you can even set a different replication strategy per keyspace per datacenter. Nodes in each DC take up num_tokens positions on a ring, each partition key is mapped to a position on that ring, and whomever owns that part of the ring is the primary for that data. Then (oversimplified) r-1 adjacent nodes become replicas for that same data. On Fri, Mar 27, 2015 at 6:55 AM, Sibbald, Charles charles.sibb...@bskyb.com wrote: http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__num_tokens So go with a default 256, and leave initial token empty: num_tokens: 256 # initial_token: Cassandra will always give each node the same number of tokens, the only time you might want to distribute this is if your instances are of different sizing/capability which is also a bad scenario. From: Björn Hachmann bjoern.hachm...@metrigo.de Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Friday, 27 March 2015 12:11 To: user user@cassandra.apache.org Subject: Re: Replication to second data center with different number of nodes 2015-03-27 11:58 GMT+01:00 Sibbald, Charles charles.sibb...@bskyb.com: Cassandra’s Vnodes config Thank you. Yes, we are using vnodes! The num_token parameter controls the number of vnodes assigned to a specific node. Might be I am seeing problems where are none. Let me rephrase my question: How does Cassandra know it has to replicate 1/3 of all keys to each single node in the second DC? I can see two ways: 1. It has to be configured explicitly. 2. It is derived from the number of nodes available in the data center at the time `nodetool rebuild` is started. Kind regards Björn Information in this email including any attachments may be privileged, confidential and is intended exclusively for the addressee. The views expressed may not be official policy, but the personal views of the originator. If you have received it in error, please notify the sender by return e-mail and delete it from your system. You should not reproduce, distribute, store, retransmit, use or disclose its contents to anyone. Please note we reserve the right to monitor all e-mail communication through our internal and external networks. SKY and the SKY marks are trademarks of Sky plc and Sky International AG and are used under licence. Sky UK Limited (Registration No. 2906991), Sky-In-Home Service Limited (Registration No. 2067075) and Sky Subscribers Services Limited (Registration No. 2340150) are direct or indirect subsidiaries of Sky plc (Registration No. 2247735). All of the companies mentioned in this paragraph are incorporated in England and Wales and share the same registered office at Grant Way, Isleworth, Middlesex TW7 5QD.
Re: write timeout
Forgot to mention I am using Cassandra 2.0.13 On Mon, Mar 23, 2015 at 5:59 PM, Anishek Agarwal anis...@gmail.com wrote: Hello, I am using a single node server class machine with 16 CPUs with 32GB RAM with a single drive attached to it. my table structure is as below CREATE TABLE t1(id bigint, ts timestamp, cat1 settext, cat2 settext, lat float, lon float, a bigint, primary key (id, ts)); I am trying to insert 300 entries per partition key with 4000 partition keys using 25 threads. Configurations write_request_timeout_in_ms: 5000 concurrent_writes: 32 heap space : 8GB Client side timeout is 12 sec using datastax java driver. Consistency level: ONE With the above configuration i try to run it 10 times to eventually generate around 300 * 4000 * 10 = 1200 entries, When i run this after the first few runs i get a WriteTimeout exception at client with 1 replica were required but only 0 acknowledged the write message. There are no errors in server log. Why does this error come how do i know what is the limit I should limit concurrent writes to a single node to. Looking at iostat disk utilization seems to be at 1-3% when running this. Please let me know if anything else is required. Regards, Anishek
write timeout
Hello, I am using a single node server class machine with 16 CPUs with 32GB RAM with a single drive attached to it. my table structure is as below CREATE TABLE t1(id bigint, ts timestamp, cat1 settext, cat2 settext, lat float, lon float, a bigint, primary key (id, ts)); I am trying to insert 300 entries per partition key with 4000 partition keys using 25 threads. Configurations write_request_timeout_in_ms: 5000 concurrent_writes: 32 heap space : 8GB Client side timeout is 12 sec using datastax java driver. Consistency level: ONE With the above configuration i try to run it 10 times to eventually generate around 300 * 4000 * 10 = 1200 entries, When i run this after the first few runs i get a WriteTimeout exception at client with 1 replica were required but only 0 acknowledged the write message. There are no errors in server log. Why does this error come how do i know what is the limit I should limit concurrent writes to a single node to. Looking at iostat disk utilization seems to be at 1-3% when running this. Please let me know if anything else is required. Regards, Anishek
Re: LCS Strategy, compaction pending tasks keep increasing
sorry i take that back we will modify different keys across threads not the same key, our storm topology is going to use field grouping to get updates for same keys to same set of bolts. On Tue, Apr 21, 2015 at 6:17 PM, Anishek Agarwal anis...@gmail.com wrote: @Bruice : I dont think so as i am giving each thread a specific key range with no overlaps this does not seem to be the case now. However we will have to test where we have to modify the same key across threads -- do u think that will cause a problem ? As far as i have read LCS is recommended for such cases. should i just switch back to SizeTiredCompactionStrategy. On Tue, Apr 21, 2015 at 6:13 PM, Brice Dutheil brice.duth...@gmail.com wrote: Could it that the app is inserting _duplicate_ keys ? -- Brice On Tue, Apr 21, 2015 at 1:52 PM, Marcus Eriksson krum...@gmail.com wrote: nope, but you can correlate I guess, tools/bin/sstablemetadata gives you sstable level information and, it is also likely that since you get so many L0 sstables, you will be doing size tiered compaction in L0 for a while. On Tue, Apr 21, 2015 at 1:40 PM, Anishek Agarwal anis...@gmail.com wrote: @Marcus I did look and that is where i got the above but it doesnt show any detail about moving from L0 -L1 any specific arguments i should try with ? On Tue, Apr 21, 2015 at 4:52 PM, Marcus Eriksson krum...@gmail.com wrote: you need to look at nodetool compactionstats - there is probably a big L0 - L1 compaction going on that blocks other compactions from starting On Tue, Apr 21, 2015 at 1:06 PM, Anishek Agarwal anis...@gmail.com wrote: the some_bits column has about 14-15 bytes of data per key. On Tue, Apr 21, 2015 at 4:34 PM, Anishek Agarwal anis...@gmail.com wrote: Hello, I am inserting about 100 million entries via datastax-java driver to a cassandra cluster of 3 nodes. Table structure is as create keyspace test with replication = {'class': 'NetworkTopologyStrategy', 'DC' : 3}; CREATE TABLE test_bits(id bigint primary key , some_bits text) with gc_grace_seconds=0 and compaction = {'class': 'LeveledCompactionStrategy'} and compression={'sstable_compression' : ''}; have 75 threads that are inserting data into the above table with each thread having non over lapping keys. I see that the number of pending tasks via nodetool compactionstats keeps increasing and looks like from nodetool cfstats test.test_bits has SSTTable levels as [154/4, 8, 0, 0, 0, 0, 0, 0, 0], Why is compaction not kicking in ? thanks anishek
Re: LCS Strategy, compaction pending tasks keep increasing
@Bruice : I dont think so as i am giving each thread a specific key range with no overlaps this does not seem to be the case now. However we will have to test where we have to modify the same key across threads -- do u think that will cause a problem ? As far as i have read LCS is recommended for such cases. should i just switch back to SizeTiredCompactionStrategy. On Tue, Apr 21, 2015 at 6:13 PM, Brice Dutheil brice.duth...@gmail.com wrote: Could it that the app is inserting _duplicate_ keys ? -- Brice On Tue, Apr 21, 2015 at 1:52 PM, Marcus Eriksson krum...@gmail.com wrote: nope, but you can correlate I guess, tools/bin/sstablemetadata gives you sstable level information and, it is also likely that since you get so many L0 sstables, you will be doing size tiered compaction in L0 for a while. On Tue, Apr 21, 2015 at 1:40 PM, Anishek Agarwal anis...@gmail.com wrote: @Marcus I did look and that is where i got the above but it doesnt show any detail about moving from L0 -L1 any specific arguments i should try with ? On Tue, Apr 21, 2015 at 4:52 PM, Marcus Eriksson krum...@gmail.com wrote: you need to look at nodetool compactionstats - there is probably a big L0 - L1 compaction going on that blocks other compactions from starting On Tue, Apr 21, 2015 at 1:06 PM, Anishek Agarwal anis...@gmail.com wrote: the some_bits column has about 14-15 bytes of data per key. On Tue, Apr 21, 2015 at 4:34 PM, Anishek Agarwal anis...@gmail.com wrote: Hello, I am inserting about 100 million entries via datastax-java driver to a cassandra cluster of 3 nodes. Table structure is as create keyspace test with replication = {'class': 'NetworkTopologyStrategy', 'DC' : 3}; CREATE TABLE test_bits(id bigint primary key , some_bits text) with gc_grace_seconds=0 and compaction = {'class': 'LeveledCompactionStrategy'} and compression={'sstable_compression' : ''}; have 75 threads that are inserting data into the above table with each thread having non over lapping keys. I see that the number of pending tasks via nodetool compactionstats keeps increasing and looks like from nodetool cfstats test.test_bits has SSTTable levels as [154/4, 8, 0, 0, 0, 0, 0, 0, 0], Why is compaction not kicking in ? thanks anishek
Re: LCS Strategy, compaction pending tasks keep increasing
have 2 CPU with 8 hyper threaded cores per node. In a related topic : I’m a bit concerned by datastax communication, usually people talk about IO as being the weak spot, but in our case it’s more about CPU. Fortunately the Moore law doesn’t really apply anymore vertically, now we have have multi core processors *and* the trend is going that way. Yet Datastax terms feels a bit *antiquated* and maybe a bit too much Oracle-y : http://www.datastax.com/enterprise-terms Node licensing is more appropriate for this century. -- Brice On Tue, Apr 21, 2015 at 11:19 PM, Sebastian Estevez sebastian.este...@datastax.com wrote: Do not enable multithreaded compaction. Overhead usually outweighs any benefit. It's removed in 2.1 because it harms more than helps: https://issues.apache.org/jira/browse/CASSANDRA-6142 All the best, [image: datastax_logo.png] http://www.datastax.com/ Sebastián Estévez Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com [image: linkedin.png] https://www.linkedin.com/company/datastax [image: facebook.png] https://www.facebook.com/datastax [image: twitter.png] https://twitter.com/datastax [image: g+.png] https://plus.google.com/+Datastax/about http://feeds.feedburner.com/datastax http://cassandrasummit-datastax.com/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. On Tue, Apr 21, 2015 at 9:06 AM, Brice Dutheil brice.duth...@gmail.com wrote: I’m not sure I get everything about storm stuff, but my understanding of LCS is that compaction count may increase the more one update data (that’s why I was wondering about duplicate primary keys). Another option is that the code is sending too much write request/s to the cassandra cluster. I don’t know haw many nodes you have, but the less node there is the more compactions. Also I’d look at the CPU / load, maybe the config is too *restrictive*, look at the following properties in the cassandra.yaml - compaction_throughput_mb_per_sec, by default the value is 16, you may want to increase it but be careful on mechanical drives, if already in SSD IO is rarely the issue, we have 64 (with SSDs) - multithreaded_compaction by default it is false, we enabled it. Compaction thread are niced, so it shouldn’t be much an issue for serving production r/w requests. But you never know, always keep an eye on IO and CPU. — Brice On Tue, Apr 21, 2015 at 2:48 PM, Anishek Agarwal anis...@gmail.com wrote: sorry i take that back we will modify different keys across threads not the same key, our storm topology is going to use field grouping to get updates for same keys to same set of bolts. On Tue, Apr 21, 2015 at 6:17 PM, Anishek Agarwal anis...@gmail.com wrote: @Bruice : I dont think so as i am giving each thread a specific key range with no overlaps this does not seem to be the case now. However we will have to test where we have to modify the same key across threads -- do u think that will cause a problem ? As far as i have read LCS is recommended for such cases. should i just switch back to SizeTiredCompactionStrategy. On Tue, Apr 21, 2015 at 6:13 PM, Brice Dutheil brice.duth...@gmail.com wrote: Could it that the app is inserting _duplicate_ keys ? -- Brice On Tue, Apr 21, 2015 at 1:52 PM, Marcus Eriksson krum...@gmail.com wrote: nope, but you can correlate I guess, tools/bin/sstablemetadata gives you sstable level information and, it is also likely that since you get so many L0 sstables, you will be doing size tiered compaction in L0 for a while. On Tue, Apr 21, 2015 at 1:40 PM, Anishek Agarwal anis...@gmail.com wrote: @Marcus I did look and that is where i got the above but it doesnt show any detail about moving from L0 -L1 any specific arguments i should try with ? On Tue, Apr 21, 2015 at 4:52 PM, Marcus Eriksson krum...@gmail.com wrote: you need to look at nodetool compactionstats - there is probably a big L0 - L1 compaction going on that blocks other compactions from starting On Tue, Apr 21, 2015 at 1:06 PM, Anishek Agarwal anis...@gmail.com wrote: the some_bits column has about 14-15 bytes of data per key. On Tue, Apr 21, 2015 at 4:34 PM, Anishek Agarwal anis...@gmail.com wrote: Hello, I am inserting about 100 million entries via datastax-java driver to a cassandra cluster of 3 nodes. Table structure is as create keyspace test with replication = {'class': 'NetworkTopologyStrategy', 'DC' : 3}; CREATE TABLE test_bits(id bigint primary key , some_bits text) with gc_grace_seconds=0 and compaction = {'class
Re: Reading hundreds of thousands of rows at once?
I think these will help speed up - removing compression - you have lot of independent columns mentioned. If you are always going to query all of them together one other thing that will help is have a full json(or some custom obj representation) of the value data and change the model to just have survey_id, hour_created,respondent_id, *json_value* On Wed, Apr 22, 2015 at 1:09 PM, John Anderson son...@gmail.com wrote: Hey, I'm looking at querying around 500,000 rows that I need to pull into a Pandas data frame for processing. Currently testing this on a single cassandra node it takes around 21 seconds: https://gist.github.com/sontek/4ca95f5c5aa539663eaf I tried introducing multiprocessing so I could use 4 processes at a time to query this and I got it down to 14 seconds: https://gist.github.com/sontek/542f13307ef9679c0094 Although shaving off 7 seconds is great it still isn't really where I would like to be in regards to performance, for this many rows I'd really like to get down to a max of 1-2 seconds query time. What types of optimization's can I make to improve the read performance when querying a large set of data? Will this timing speed up linearly as I add more nodes? This is what the schema looks like currently: https://gist.github.com/sontek/d6fa3fc1b6d085ad3fa4 I'm not tied to the current schema at all, its mostly just a replication of what we have in SQL Server. I'm more interested in what things I can change to make querying it faster. Thanks, John
Re: Reading hundreds of thousands of rows at once?
also might want to go through a thread here in with subject High latencies for simple queries On Wed, Apr 22, 2015 at 1:55 PM, Anishek Agarwal anis...@gmail.com wrote: I think these will help speed up - removing compression - you have lot of independent columns mentioned. If you are always going to query all of them together one other thing that will help is have a full json(or some custom obj representation) of the value data and change the model to just have survey_id, hour_created,respondent_id, *json_value* On Wed, Apr 22, 2015 at 1:09 PM, John Anderson son...@gmail.com wrote: Hey, I'm looking at querying around 500,000 rows that I need to pull into a Pandas data frame for processing. Currently testing this on a single cassandra node it takes around 21 seconds: https://gist.github.com/sontek/4ca95f5c5aa539663eaf I tried introducing multiprocessing so I could use 4 processes at a time to query this and I got it down to 14 seconds: https://gist.github.com/sontek/542f13307ef9679c0094 Although shaving off 7 seconds is great it still isn't really where I would like to be in regards to performance, for this many rows I'd really like to get down to a max of 1-2 seconds query time. What types of optimization's can I make to improve the read performance when querying a large set of data? Will this timing speed up linearly as I add more nodes? This is what the schema looks like currently: https://gist.github.com/sontek/d6fa3fc1b6d085ad3fa4 I'm not tied to the current schema at all, its mostly just a replication of what we have in SQL Server. I'm more interested in what things I can change to make querying it faster. Thanks, John
Re: LCS Strategy, compaction pending tasks keep increasing
@Marcus I did look and that is where i got the above but it doesnt show any detail about moving from L0 -L1 any specific arguments i should try with ? On Tue, Apr 21, 2015 at 4:52 PM, Marcus Eriksson krum...@gmail.com wrote: you need to look at nodetool compactionstats - there is probably a big L0 - L1 compaction going on that blocks other compactions from starting On Tue, Apr 21, 2015 at 1:06 PM, Anishek Agarwal anis...@gmail.com wrote: the some_bits column has about 14-15 bytes of data per key. On Tue, Apr 21, 2015 at 4:34 PM, Anishek Agarwal anis...@gmail.com wrote: Hello, I am inserting about 100 million entries via datastax-java driver to a cassandra cluster of 3 nodes. Table structure is as create keyspace test with replication = {'class': 'NetworkTopologyStrategy', 'DC' : 3}; CREATE TABLE test_bits(id bigint primary key , some_bits text) with gc_grace_seconds=0 and compaction = {'class': 'LeveledCompactionStrategy'} and compression={'sstable_compression' : ''}; have 75 threads that are inserting data into the above table with each thread having non over lapping keys. I see that the number of pending tasks via nodetool compactionstats keeps increasing and looks like from nodetool cfstats test.test_bits has SSTTable levels as [154/4, 8, 0, 0, 0, 0, 0, 0, 0], Why is compaction not kicking in ? thanks anishek
Network transfer to one node twice as others
Hello, We are using cassandra 2.0.14 and have a cluster of 3 nodes. I have a writer test (written in java) that runs 50 threads to populate data to a single table in a single keyspace. when i look at the iftop I see that the amount of network transfer happening on two nodes is same but on one of the nodes its almost 2ice as the other two, Any reason that would be the case ? Thanks Anishek
Re: LCS Strategy, compaction pending tasks keep increasing
the some_bits column has about 14-15 bytes of data per key. On Tue, Apr 21, 2015 at 4:34 PM, Anishek Agarwal anis...@gmail.com wrote: Hello, I am inserting about 100 million entries via datastax-java driver to a cassandra cluster of 3 nodes. Table structure is as create keyspace test with replication = {'class': 'NetworkTopologyStrategy', 'DC' : 3}; CREATE TABLE test_bits(id bigint primary key , some_bits text) with gc_grace_seconds=0 and compaction = {'class': 'LeveledCompactionStrategy'} and compression={'sstable_compression' : ''}; have 75 threads that are inserting data into the above table with each thread having non over lapping keys. I see that the number of pending tasks via nodetool compactionstats keeps increasing and looks like from nodetool cfstats test.test_bits has SSTTable levels as [154/4, 8, 0, 0, 0, 0, 0, 0, 0], Why is compaction not kicking in ? thanks anishek
LCS Strategy, compaction pending tasks keep increasing
Hello, I am inserting about 100 million entries via datastax-java driver to a cassandra cluster of 3 nodes. Table structure is as create keyspace test with replication = {'class': 'NetworkTopologyStrategy', 'DC' : 3}; CREATE TABLE test_bits(id bigint primary key , some_bits text) with gc_grace_seconds=0 and compaction = {'class': 'LeveledCompactionStrategy'} and compression={'sstable_compression' : ''}; have 75 threads that are inserting data into the above table with each thread having non over lapping keys. I see that the number of pending tasks via nodetool compactionstats keeps increasing and looks like from nodetool cfstats test.test_bits has SSTTable levels as [154/4, 8, 0, 0, 0, 0, 0, 0, 0], Why is compaction not kicking in ? thanks anishek
Re: LCS Strategy, compaction pending tasks keep increasing
I am on version 2.0.14, will update once i get the stats up for the writes again On Tue, Apr 21, 2015 at 4:46 PM, Carlos Rolo r...@pythian.com wrote: Are you on version 2.1.x? Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649 www.pythian.com On Tue, Apr 21, 2015 at 1:06 PM, Anishek Agarwal anis...@gmail.com wrote: the some_bits column has about 14-15 bytes of data per key. On Tue, Apr 21, 2015 at 4:34 PM, Anishek Agarwal anis...@gmail.com wrote: Hello, I am inserting about 100 million entries via datastax-java driver to a cassandra cluster of 3 nodes. Table structure is as create keyspace test with replication = {'class': 'NetworkTopologyStrategy', 'DC' : 3}; CREATE TABLE test_bits(id bigint primary key , some_bits text) with gc_grace_seconds=0 and compaction = {'class': 'LeveledCompactionStrategy'} and compression={'sstable_compression' : ''}; have 75 threads that are inserting data into the above table with each thread having non over lapping keys. I see that the number of pending tasks via nodetool compactionstats keeps increasing and looks like from nodetool cfstats test.test_bits has SSTTable levels as [154/4, 8, 0, 0, 0, 0, 0, 0, 0], Why is compaction not kicking in ? thanks anishek --
Re: Unable to connect via cqlsh or datastax-driver
did u setup CQLSH_HOST variable to the ip so cqlsh uses that ? On Tue, May 5, 2015 at 8:50 PM, Björn Hachmann bjoern.hachm...@metrigo.de wrote: Hello, I am unable to connect to the nodes of our second datacenter, not even from localhost. The error message I receive is: Connection error: ('Unable to connect to any servers', {'...': OperationTimedOut('errors=None, last_host=None',)}) I already checked some things: - The node starts to listen for cql clients on the expected port (extract from the log): Starting listening for CQL clients on .../192.168.1.23:9042 - The port is open and accepts connections via telnet. - nodetool info works and returns: Gossip active : true Thrift active : true Native Transport active: true - nodetool netstats: Mode: NORMAL - nodetool statusbinary running Any help would be highly appreciated! Thank you very much. Kind regards Björn
Re: Read performance
how many sst tables were there? what compaction are you using ? These properties define how many possible disk reads cassandra has to do to get all the data you need depending on which SST Tables have data for your partition key. On Fri, May 8, 2015 at 6:25 PM, Alprema alpr...@alprema.com wrote: I was planning on using a more server-friendly strategy anyway (by parallelizing my workload on multiple metrics) but my concern here is more about the raw numbers. According to the trace and my estimation of the data size, the read from disk was done at about 30MByte/s and the transfer between the responsible node and the coordinator was done at 120Mbits/s which doesn't seem right given that the cluster was not busy and the network is Gbit capable. I know that there is some overhead, but these numbers seem odd to me, do they seem normal to you ? On Fri, May 8, 2015 at 2:34 PM, Bryan Holladay holla...@longsight.com wrote: Try breaking it up into smaller chunks using multiple threads and token ranges. 86400 is pretty large. I found ~1000 results per query is good. This will spread the burden across all servers a little more evenly. On Thu, May 7, 2015 at 4:27 AM, Alprema alpr...@alprema.com wrote: Hi, I am writing an application that will periodically read big amounts of data from Cassandra and I am experiencing odd performances. My column family is a classic time series one, with series ID and Day as partition key and a timestamp as clustering key, the value being a double. The query I run gets all the values for a given time series for a given day (so about 86400 points): SELECT UtcDate, ValueFROM Metric_OneSecWHERE MetricId = 12215ece-6544-4fcf-a15d-4f9e9ce1567eAND Day = '2015-05-05 00:00:00+'LIMIT 86400; This takes about 450ms to run and when I trace the query I see that it takes about 110ms to read the data from disk and 224ms to send the data from the responsible node to the coordinator (full trace in attachment). I did a quick estimation of the requested data (correct me if I'm wrong): 86400 * (column name + column value + timestamp + ttl) = 86400 * (8 + 8 + 8 + 8?) = 2.6Mb Let's say about 3Mb with misc. overhead, so these timings seem pretty slow to me for a modern SSD and a 1Gb/s NIC. Do those timings seem normal? Am I missing something? Thank you, Kévin
Re: error='Cannot allocate memory' (errno=12)
the memory cassandra is trying to allocate is pretty small. you sure there is no hardware failure on the machine. what is the free ram on the box ? On Mon, May 11, 2015 at 3:28 PM, Rahul Bhardwaj rahul.bhard...@indiamart.com wrote: Hi All, We have cluster of 3 nodes with 64GB RAM each. My cluster was running in healthy state. Suddenly one machine's cassandra daemon stops working and shut down. On restarting it after 2 minutes it again stops and is getting stop after returning below error in cassandra.log Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x7fd064dc6000, 12288, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 12288 bytes for committing reserved memory. # An error report file with more information is saved as: # /tmp/hs_err_pid23215.log INFO 09:50:41 Loading settings from file:/etc/cassandra/default.conf/cassandra.yaml INFO 09:50:41 Node configuration:[authenticator=AllowAllAuthenticator; authorizer=AllowAllAuthorizer; auto_snapshot=true; batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024; cas_contention_timeout_in_ms=1000; client_encryption_options=REDACTED; cluster_name=Test Cluster; column_index_size_in_kb=64; commit_failure_policy=stop; commitlog_directory=/var/lib/cassandra/commitlog; commitlog_segment_size_in_mb=64; commitlog_sync=periodic; commitlog_sync_period_in_ms=1; compaction_throughput_mb_per_sec=16; concurrent_compactors=4; concurrent_counter_writes=32; concurrent_reads=32; concurrent_writes=32; counter_cache_save_period=7200; counter_cache_size_in_mb=null; counter_write_request_timeout_in_ms=5000; cross_node_timeout=false; data_file_directories=[/var/lib/cassandra/data]; disk_failure_policy=stop; dynamic_snitch_badness_threshold=0.1; dynamic_snitch_reset_interval_in_ms=60; dynamic_snitch_update_interval_in_ms=100; endpoint_snitch=GossipingPropertyFileSnitch; hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb=1024; incremental_backups=false; index_summary_capacity_in_mb=null; index_summary_resize_interval_in_minutes=60; inter_dc_tcp_nodelay=false; internode_compression=all; key_cache_save_period=14400; key_cache_size_in_mb=null; listen_address=null; max_hint_window_in_ms=1080; max_hints_delivery_threads=2; memtable_allocation_type=heap_buffers; native_transport_port=9042; num_tokens=256; partitioner=org.apache.cassandra.dht.Murmur3Partitioner; permissions_validity_in_ms=2000; range_request_timeout_in_ms=100; read_request_timeout_in_ms=9; request_scheduler=org.apache.cassandra.scheduler.NoScheduler; request_timeout_in_ms=9; row_cache_save_period=0; row_cache_size_in_mb=0; rpc_address=null; rpc_keepalive=true; rpc_port=9160; rpc_server_type=sync; saved_caches_directory=/var/lib/cassandra/saved_caches; seed_provider=[{class_name=org.apache.cassandra.locator.SimpleSeedProvider, parameters=[{seeds=206.191.151.199}]}]; server_encryption_options=REDACTED; snapshot_before_compaction=false; ssl_storage_port=7001; sstable_preemptive_open_interval_in_mb=50; start_native_transport=true; start_rpc=true; storage_port=7000; thrift_framed_transport_size_in_mb=15; tombstone_failure_threshold=10; tombstone_warn_threshold=1000; trickle_fsync=false; trickle_fsync_interval_in_kb=10240; truncate_request_timeout_in_ms=6; write_request_timeout_in_ms=9] ERROR 09:50:41 Exception encountered during startup java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) ~[na:1.7.0_60] at java.lang.Thread.start(Thread.java:714) ~[na:1.7.0_60] at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949) ~[na:1.7.0_60] at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1590) ~[na:1.7.0_60] at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:333) ~[na:1.7.0_60] at java.util.concurrent.ScheduledThreadPoolExecutor.scheduleWithFixedDelay(ScheduledThreadPoolExecutor.java:594) ~[na:1.7.0_60] at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor.scheduleWithFixedDelay(DebuggableScheduledThreadPoolExecutor.java:61) ~[apache-cassandra-2.1.2.jar:2.1.2-SNAPSHOT] at org.apache.cassandra.gms.Gossiper.start(Gossiper.java:1188) ~[apache-cassandra-2.1.2.jar:2.1.2-SNAPSHOT] at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:721) ~[apache-cassandra-2.1.2.jar:2.1.2-SNAPSHOT] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:643) ~[apache-cassandra-2.1.2.jar:2.1.2-SNAPSHOT] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:535) ~[apache-cassandra-2.1.2.jar:2.1.2-SNAPSHOT]
Re: error='Cannot allocate memory' (errno=12)
Well i havent used 2.1.x cassandra neither java 8 but any reason for not using oracle JDK as i thought thats what is recommended. i saw a thread earlier stating java 8 with 2.0.14+ cassandra is tested but not sure about 2.1.x versions. On Mon, May 11, 2015 at 4:04 PM, Rahul Bhardwaj rahul.bhard...@indiamart.com wrote: PFA of error log hs_err_pid9656.log https://docs.google.com/a/indiamart.com/file/d/0B0hlSlesIPVfaU9peGwxSXdsZGc/edit?usp=drive_web On Mon, May 11, 2015 at 3:58 PM, Rahul Bhardwaj rahul.bhard...@indiamart.com wrote: free RAM: free -m total used free sharedbuffers cached Mem: 64398 23753 40644 0108 8324 -/+ buffers/cache: 15319 49078 Swap: 2925 15 2909 ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 515041 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515041 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Also attaching complete error file On Mon, May 11, 2015 at 3:35 PM, Anishek Agarwal anis...@gmail.com wrote: the memory cassandra is trying to allocate is pretty small. you sure there is no hardware failure on the machine. what is the free ram on the box ? On Mon, May 11, 2015 at 3:28 PM, Rahul Bhardwaj rahul.bhard...@indiamart.com wrote: Hi All, We have cluster of 3 nodes with 64GB RAM each. My cluster was running in healthy state. Suddenly one machine's cassandra daemon stops working and shut down. On restarting it after 2 minutes it again stops and is getting stop after returning below error in cassandra.log Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x7fd064dc6000, 12288, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 12288 bytes for committing reserved memory. # An error report file with more information is saved as: # /tmp/hs_err_pid23215.log INFO 09:50:41 Loading settings from file:/etc/cassandra/default.conf/cassandra.yaml INFO 09:50:41 Node configuration:[authenticator=AllowAllAuthenticator; authorizer=AllowAllAuthorizer; auto_snapshot=true; batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024; cas_contention_timeout_in_ms=1000; client_encryption_options=REDACTED; cluster_name=Test Cluster; column_index_size_in_kb=64; commit_failure_policy=stop; commitlog_directory=/var/lib/cassandra/commitlog; commitlog_segment_size_in_mb=64; commitlog_sync=periodic; commitlog_sync_period_in_ms=1; compaction_throughput_mb_per_sec=16; concurrent_compactors=4; concurrent_counter_writes=32; concurrent_reads=32; concurrent_writes=32; counter_cache_save_period=7200; counter_cache_size_in_mb=null; counter_write_request_timeout_in_ms=5000; cross_node_timeout=false; data_file_directories=[/var/lib/cassandra/data]; disk_failure_policy=stop; dynamic_snitch_badness_threshold=0.1; dynamic_snitch_reset_interval_in_ms=60; dynamic_snitch_update_interval_in_ms=100; endpoint_snitch=GossipingPropertyFileSnitch; hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb=1024; incremental_backups=false; index_summary_capacity_in_mb=null; index_summary_resize_interval_in_minutes=60; inter_dc_tcp_nodelay=false; internode_compression=all; key_cache_save_period=14400; key_cache_size_in_mb=null; listen_address=null; max_hint_window_in_ms=1080; max_hints_delivery_threads=2; memtable_allocation_type=heap_buffers; native_transport_port=9042; num_tokens=256; partitioner=org.apache.cassandra.dht.Murmur3Partitioner; permissions_validity_in_ms=2000; range_request_timeout_in_ms=100; read_request_timeout_in_ms=9; request_scheduler=org.apache.cassandra.scheduler.NoScheduler; request_timeout_in_ms=9; row_cache_save_period=0; row_cache_size_in_mb=0; rpc_address=null; rpc_keepalive=true; rpc_port=9160; rpc_server_type=sync; saved_caches_directory=/var/lib/cassandra/saved_caches; seed_provider=[{class_name=org.apache.cassandra.locator.SimpleSeedProvider, parameters=[{seeds=206.191.151.199}]}]; server_encryption_options=REDACTED; snapshot_before_compaction=false; ssl_storage_port=7001; sstable_preemptive_open_interval_in_mb=50; start_native_transport=true; start_rpc=true; storage_port=7000; thrift_framed_transport_size_in_mb=15; tombstone_failure_threshold
Reads failing at around 4000 QPS
Hello everyone, i have a 3 node cluster with Cassandra 2.0.14 on centos in the same Data center with RF=3 and i am using CL=Local_Quorum by default for the read and write operations. I have given about 5 GB of heap space to cassandra. I have 40 core machines with 3 separate SATA disks with commitlog on one and data directories on the other two. I am doing a read + write at the same with about 4000 QPS. I am getting Read failures on the client where one of the replica didnot respond. When i look at the cassandra logs i see a lot failures as i have attached (read_failures.txt). Am i overloading the system too much ? 4000 QPS doesnt seem too much at first glance. Please let me know if any other details are required. DataModel : partition_key, clustering_key, col1 Regards, Anishek INFO [ScheduledTasks:1] 2015-05-12 15:30:01,135 MessagingService.java (line 875) 1482 READ messages dropped in last 5000ms INFO [ScheduledTasks:1] 2015-05-12 15:30:01,136 StatusLogger.java (line 55) Pool NameActive Pending Completed Blocked All Time Blocked INFO [ScheduledTasks:1] 2015-05-12 15:30:01,136 StatusLogger.java (line 70) ReadStage32 46762438994 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,137 StatusLogger.java (line 70) RequestResponseStage 0 04624954 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,137 StatusLogger.java (line 70) ReadRepairStage 0 1 162868 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,137 StatusLogger.java (line 70) MutationStage 0 03794927 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,137 StatusLogger.java (line 70) ReplicateOnWriteStage 0 0 0 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,138 StatusLogger.java (line 70) GossipStage 0 0 5339 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,138 StatusLogger.java (line 70) CacheCleanupExecutor 0 0 0 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,138 StatusLogger.java (line 70) MigrationStage0 0 0 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,138 StatusLogger.java (line 70) MemoryMeter 0 0 38 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,139 StatusLogger.java (line 70) ValidationExecutor0 0 0 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,139 StatusLogger.java (line 70) FlushWriter 0 0 31 0 9 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,139 StatusLogger.java (line 70) InternalResponseStage 0 0 0 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,139 StatusLogger.java (line 70) AntiEntropyStage 0 0 0 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,139 StatusLogger.java (line 70) MemtablePostFlusher 0 0 64 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,140 StatusLogger.java (line 70) MiscStage 0 0 0 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,140 StatusLogger.java (line 70) PendingRangeCalculator0 0 3 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,140 StatusLogger.java (line 70) commitlog_archiver0 0 0 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,140 StatusLogger.java (line 70) CompactionExecutor0 0 1149 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,141 StatusLogger.java (line 70) HintedHandoff 0 1 4 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,141 StatusLogger.java (line 79) CompactionManager 0 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,141 StatusLogger.java (line 81) Commitlog n/a 0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,141 StatusLogger.java (line 93) MessagingServicen/a 0/0 INFO [ScheduledTasks:1] 2015-05-12 15:30:01,141 StatusLogger.java (line 103) Cache Type Size Capacity KeysToSave INFO [ScheduledTasks:1] 2015-05-12 15:30:01,141 StatusLogger.java (line 105) KeyCache 59330812104857600 all INFO
text partition key Bloom filters fp is 1 always, why?
Hello, I have a text partition key for one of the CF. The cfstats on that table seems to show that the bloom filter false positive ratio is always 1. Also the bloom filter is using very less space. Do bloom filters not work well with text partition keys ? I can assume this as it can no way detect the length of the text and hence would have a very high false positive. The text partition key is combined using a long + _ + epoch_time_in_hours, would it be better if we have a composite partition key of the (long, epoch_time_in_hours) rather than combining it as a text key ? Thanks anishek
SST Tables Per read in cfhistorgrams
Hello, I am seeing that even though the bloom filter fp ratio being set to 0.1 the actual is at about .55 and on looking at the histograms of the table i see that there are reads going to 3+ SSTtables even though the way i am querying for read it should look at the most recent row only since i have time as part of my partition_key. I have a composite partition key with ((long,timestamp)). Question: The Number of SST tables read, would it also include those where the bloom filter gave a false positive ? or is it just the number to actually do the reads. Thanks Anishek
Binary Protocol Version and CQL version supported in 2.0.14
Hello, I was trying to find what protocol versions are supported in Cassandara 2.0.14 and after reading multiple links i am very very confused. Please correct me if my understanding is correct: - Binary Protocol version and CQL Spec version are different ? - Cassandra 2.0.x supports CQL 3 ? - Is there a different Binary Protocol version between 2.0.x and 2.1.x ? Is there some link which states what version of cassandra supports what binary protocol version and CQL spec version (Additionally showing which drivers support what will be great too) ? The link http://www.datastax.com/dev/blog/java-driver-2-1-2-native-protocol-v3 shows some info but i am not sure what the supported protocol versions are referring to(binary or CQL spec). Thanks Anishek
Re: Uderstanding Read after update
Thanks Tyler for the validations, I have a follow up question. One SSTable doesn't have precedence over another. Instead, when the same cell exists in both sstables, the one with the higher write timestamp wins. if my table has 5(non partition key columns) and i update only 1 of them then the new SST table should have only that entry, which means if i query everything for that parition key, cassandra has to have the timestamp matched per column for a partition key across SST tables to get me the data ? On Fri, Apr 10, 2015 at 10:52 PM, Tyler Hobbs ty...@datastax.com wrote: SST Table level bloom filters have details as to what partition keys are in that table. So to clear up my understanding, if I insert and then have a update to the same row after some time (assuming both go to different SST Tables), then during read cassandra will read data from both SST Tables and merge them in order of time series with Data in Second SST table for the row taking precedence over the First SST Table and return the result ? That's approximately correct. The only part that's incorrect is how merging works. One SSTable doesn't have precedence over another. Instead, when the same cell exists in both sstables, the one with the higher write timestamp wins. Does it mark the old column as tombstone in the previous SST Table or wait for compaction to remove the old data ? It just waits for compaction to remove the old data, there's no tombstone. when the data is in mem cache it also keep tracks of unique keys in that memtable so when it writes to disk it can use that to derive the right size of bloom filter for that SST Table ? That's correct, it knows the number of keys before the bloom filter is created. -- Tyler Hobbs DataStax http://datastax.com/
Re: PHP Cassandra Driver for 2.0.13
Hey Alex, We are planning on using Cassandra 2.0.13 and looks like it will take us a month to go production. Since the team that needs PHP is only going to read, if we dont think there is to much integration testing or otherwise we need to do with PHP driver so if we get a PHP production driver in 3 weeks, i think we should be fine, though i still have to discuss this with the other team, they might not be willing to wait so long. thanks On Sat, Apr 11, 2015 at 12:52 AM, Alex Popescu al...@datastax.com wrote: What Cassandra version are you using? How soon will you need a production ready PHP driver? On Fri, Apr 10, 2015 at 5:47 AM, Anishek Agarwal anis...@gmail.com wrote: Hello, As part of using this for our project one of our teams need PHP driver for cassandra. the datastax page says its in ALPHA, is there some release candidate that people have used or any way to get this working with PHP ? Thanks Anishek -- Bests, Alex Popescu | @al3xandru Sen. Product Manager @ DataStax
Re: PHP Cassandra Driver for 2.0.13
the php team is very stringent about response times, i will see if we can do a node js web service or some form of inter process communication setup between php == python to achieve this, thanks for the idea. On Fri, Apr 10, 2015 at 7:13 PM, Michael Dykman mdyk...@gmail.com wrote: Somewhat over a year ago, I set out to address the exact same issue for our high-traffic PHP site. After several failed attempts, I tried to wrap the C++ driver (as it was then) in extern C wrappers before I gave up when I realized the driver was pre-alpha. The current implementation provides C bindings out of the box but it relative immaturity still makes it look like too much of a risk. Ultimately, we set up a web service (json in/json out) written in Java which uses the datastax Java driver to accommodate our PHP's cassandra needs. An arbitrary number of parameterized queries can be passed to the service which runs those queries in parallel and the result is both reliable and very fast. I don't think it would be easy (or even possible) for a PHP implementation to take advantage of the async interface which is where most of the performance gain is to be had. On Fri, Apr 10, 2015 at 8:47 AM, Anishek Agarwal anis...@gmail.com wrote: Hello, As part of using this for our project one of our teams need PHP driver for cassandra. the datastax page says its in ALPHA, is there some release candidate that people have used or any way to get this working with PHP ? Thanks Anishek -- - michael dykman - mdyk...@gmail.com May the Source be with you.
Re: Heap memory usage while writing
I do understand how MaxTenuringThreshold works, thanks for your evaluation though. I dont think you saw my complete post with the values i have used for the heap size and and the *memtable_total_space_in_mb=2048* which is two times smaller than the young generation space i am using. additionally *memtable_flush_queue_size=1 *so there are not many memtables in memory, this coupled with the fact that i am writing out to cassandra wit 20 threads, it should pretty much just collect the objects from ParNewGC, *which is what it is doing now. * there are only 2 CMS collections that happened for me in 15 mins when running at full capacity, what i am now concerned about is that the CMS-remark phase is about 70 ms and that is something i am looking to bring down. There seems to be valuable pointers @ *Cassandra-8150 *still which i am going to try. On Fri, Apr 10, 2015 at 7:26 PM, ssiv...@gmail.com ssiv...@gmail.com wrote: MaxTenuringThreshold is low as i think most of the objects should be ephemeral with only writes. You don't understant how *MaxTenuringThreshold* works. If you keep it low, than GC will move objects which is still alive to old gen space. Yes, they ephemeral, but C* will keep it until flushed to disk. So, again, you should balance *heap space*, *memtable_total_space_in_mb, memtable_cleanup_threshold *and your *disk_throughput *to rid off memtables as soon as possible. If *memtable_total_space_in_mb *is large and young gen is large too, then you have to increase MaxTenuringThreshold, to keep CMS off of moving data to old gen. If you sure that young gen is filled not so fast, that you can increase *CMSWaitDuration* to avoid useless calls of CMS. On 04/10/2015 03:42 PM, Anishek Agarwal wrote: Sorry i forgot to update but i am not using CMSIncrementalMode anymore as it over rides UseCMSInitiatingOccupancyOnly. @Graham : thanks for the CMSParallelInitialMarkEnabled and CMSEdenChunksRecordAlways i havent used them, i will try it. My initial mark is only around 6ms though. With my current config(with incorporating the changes above), I have been able to reduce the number of CMS run significantly now and mostly ParNewGC is running but when CMS triggers it takes a lot of time for Remark hence started using -XX:+CMSParallelRemarkEnabled which gave some improvement. This is still around 70 ms. MaxTenuringThreshold is low as i think most of the objects should be ephemeral with only writes. @Sebastian : I started from that Issue :), though i havent tried the GC affinity ones as of yet still. Thanks for the link! Thanks anishek On Fri, Apr 10, 2015 at 5:49 PM, Sebastian Estevez sebastian.este...@datastax.com wrote: Did you check out Cassandra-8150? On Apr 10, 2015 7:04 AM, Anishek Agarwal anis...@gmail.com wrote: Hey, Any reason you think the MaxTenuringThreshold should be increased. I am pumping data at full capacity that a single nodes seems to take so all the data becomes stale soon enough (when its flushed), additionally the whole memtable can be in young generation only. There seems to be enough additional space to even hold the bloom filters for the respective SSTTAbles i would guess. I will try with the CMSWaitDuration that should help in reducing the CMS initial mark phase i think. Though i am not sure what is getting moved to old generation continuously to fill it ? Thanks for the pointers. On Fri, Apr 10, 2015 at 12:12 PM, ssiv...@gmail.com ssiv...@gmail.com wrote: Hi, You should increase *MaxTenuringThreshold* and *CMSWaitDuration* to keep your data in young generation longer (until the data will be flushed to disk). Depending on your load, combine values of the next parameters: *HEAP_NEWSIZE, memtable_total_space_in_mb, memtable_cleanup_threshold *and your *disk_throughput*. Ideally, only ParNewGC will work to collect ephemeral objects, and it will take very short delays. On 04/09/2015 09:30 AM, Anishek Agarwal wrote: Hello, We have only on CF as CREATE TABLE t1(id bigint, ts timestamp, definition text, primary key (id, ts)) with clustering order by (ts desc) and gc_grace_seconds=0 and compaction = {'class': 'DateTieredCompactionStrategy', 'timestamp_resolution':'SECONDS', 'base_time_seconds':'20', 'max_sstable_age_days':'30'} and compression={'sstable_compression' : ''}; on a single Node using the following in cassandra.yaml: memtable_total_space_in_mb: 2048 commitlog_total_space_in_mb: 4096 memtable_flush_writers: 2 memtable_flush_queue_size: 1 cassandra-env.sh MAX_HEAP_SIZE=8G HEAP_NEWSIZE=5120M JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=6 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB
Heap memory usage while writing
Hello, We have only on CF as CREATE TABLE t1(id bigint, ts timestamp, definition text, primary key (id, ts)) with clustering order by (ts desc) and gc_grace_seconds=0 and compaction = {'class': 'DateTieredCompactionStrategy', 'timestamp_resolution':'SECONDS', 'base_time_seconds':'20', 'max_sstable_age_days':'30'} and compression={'sstable_compression' : ''}; on a single Node using the following in cassandra.yaml: memtable_total_space_in_mb: 2048 commitlog_total_space_in_mb: 4096 memtable_flush_writers: 2 memtable_flush_queue_size: 1 cassandra-env.sh MAX_HEAP_SIZE=8G HEAP_NEWSIZE=5120M JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=6 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:MaxPermSize=256m JVM_OPTS=$JVM_OPTS -XX:+AggressiveOpts JVM_OPTS=$JVM_OPTS -XX:+UseCompressedOops JVM_OPTS=$JVM_OPTS -XX:+CMSIncrementalMode JVM_OPTS=$JVM_OPTS -XX:+CMSIncrementalPacing JVM_OPTS=$JVM_OPTS -XX:+PrintGCDetails JVM_OPTS=$JVM_OPTS -XX:+PrintGCTimeStamps -verbose:gc JVM_OPTS=$JVM_OPTS -Xloggc:/home/anishek/apache-cassandra-2.0.13/logs/gc.log JVM_OPTS=$JVM_OPTS -XX:+PrintHeapAtGC JVM_OPTS=$JVM_OPTS -XX:+PrintTenuringDistribution I am writing via 20 threads continuously to this table above. I see that some data keeps moving from the young generation to the older generation continuously. I am wondering why this is happening. Given i am writing constantly and my young generation is more than twice the max mem table space used i would think only the young generation space would be used and nothing would ever go old generation. ** System.log show no compactions happening. ** There are no read operations. ** Cassandra version 2.0.13 on centos with 16 cores and 16 GB Ram Thanks Anishek
Re: log all the query statement
Hey Peter, This is from the perspective of 2.0.13 but there should be something similar in your version. Can you enable debug log for cassandra and see if the log files have additional info. Depending on how soon/later in you test you get the error, you might also want to modify the maxBackupIndex or maxFileSize to make sure u keep enough log files around. anishek On Thu, Apr 2, 2015 at 11:53 AM, 鄢来琼 laiqiong@gtafe.com wrote: Hi all, Cassandra 2.1.2 is used in my project, but some node is down after executing query some statements. Could I configure the Cassandra to log all the executed statement? Hope the log file can be used to identify the problem. Thanks. Peter
Re: Throttle Heavy Read / Write Loads
may be just increase the read and write timeouts at cassandra currently at 5 sec i think. i think the datastax java client driver provides ability to say how many max requests per connection are to be sent, you can try and lower that to limit excessive requests along with limiting the number of connections a client can do. just out of curiosity how long are GC pauses for you both ParNew and CMS and at what intervals are you seeing the GC happening. I just recently spent time to tune it and would be good to know if its working well. thanks anishek On Fri, Jun 5, 2015 at 12:03 AM, Anuj Wadehra anujw_2...@yahoo.co.in wrote: We are using Cassandra 2.0.14 with Hector as client ( will be gradually moving to CQL Driver ). Often we see that heavy read and write loads lead to Cassandra timeouts and unpredictable results due to gc pauses and request timeouts. We need to know the best way to throttle read and write load on Cassandra such that even if heavy operations are slower they complete gracefully. This will also shield us against misbehaving clients. I was thinking of limiting rpc connections via rpc_max_threads property and implementing connection pool at client side. I would appreciate if you could please share your suggestions on the above mentioned approach or share any alternatives to the approach. Thanks Anuj Wadehra
DTCS - nodetool repair - TTL
Hello all, We are running c* version 2.0.15. We have 5 nodes with RF=3. We are using DTCS and on all inserts we have a TTL of 30 days. We have no deletes.We just have one CF. When i run nodetool repair on a node i notice a lot of extra sst tables created, this I think is due to the fact that its trying to reconcile the correct values across different nodes. What i am trying to figure out now is how will this affect the performance after the ttl is reached for rows. As far as i understood from Spotify DTCS https://labs.spotify.com/tag/dtcs/ it looks like DTCS will drop the whole SST table once the ttl is reached as it compacts data which are inserted around the same time into same SST table. Now when repair happens we have these new SST Tables which are earlier in the timeline and hence will have tombstones alive for sometime. for ex if the machine is up for 2 weeks and i run repair now for the first time then the new sst tables might have data which is from anywhere in the previous weeks and hence even though the SST tables created during week 1 will get dropped off in the starting of 5th Week because of repair there will additional SST tables which will have tombstones till they reach their eventual drop state a few weeks later. Am i thinking correct ? This means that we might still have lot of tombstones lying around as compaction is less frequent for older tables ? thanks anishek
Re: handling down node cassandra 2.0.15
nope its not On Mon, Nov 16, 2015 at 5:48 PM, sai krishnam raju potturi < pskraj...@gmail.com> wrote: > Is that a seed node? > > On Mon, Nov 16, 2015, 05:21 Anishek Agarwal <anis...@gmail.com> wrote: > >> Hello, >> >> We are having a 3 node cluster and one of the node went down due to a >> hardware memory failure looks like. We followed the steps below after the >> node was down for more than the default value of *max_hint_window_in_ms* >> >> I tried to restart cassandra by following the steps @ >> >> >>1. >> >> http://docs.datastax.com/en/cassandra/1.2/cassandra/operations/ops_replace_node_t.html >>2. >> >> http://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html >> >> *except the "clear data" part as it was not specified in second blog >> above.* >> >> i was trying to restart the same node that went down, however I did not >> get the messages in log files as stated in 2 against "StorageService" >> >> instead it just tried to replay and then stopped with the error message >> as below: >> >> *ERROR [main] 2015-11-16 15:27:22,944 CassandraDaemon.java (line 584) >> Exception encountered during startup* >> *java.lang.RuntimeException: Cannot replace address with a node that is >> already bootstrapped* >> >> Can someone please help me if there is something i am doing wrong here. >> >> Thanks for the help in advance. >> >> Regards, >> Anishek >> >
Re: handling down node cassandra 2.0.15
Hey Josh I did set the replace address which was same as the address of the machine which went down so it was in place. anishek On Mon, Nov 16, 2015 at 10:33 PM, Josh Smith <josh.sm...@careerbuilder.com> wrote: > Sis you set the JVM_OPTS to replace address? That is usually the error I > get when I forget to set the replace_address on Cassandra-env. > > > > JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=address_of_dead_node > > > > > > *From:* Anishek Agarwal [mailto:anis...@gmail.com] > *Sent:* Monday, November 16, 2015 9:25 AM > *To:* user@cassandra.apache.org > *Subject:* Re: handling down node cassandra 2.0.15 > > > > nope its not > > > > On Mon, Nov 16, 2015 at 5:48 PM, sai krishnam raju potturi < > pskraj...@gmail.com> wrote: > > Is that a seed node? > > > > On Mon, Nov 16, 2015, 05:21 Anishek Agarwal <anis...@gmail.com> wrote: > > Hello, > > > > We are having a 3 node cluster and one of the node went down due to a > hardware memory failure looks like. We followed the steps below after the > node was down for more than the default value of *max_hint_window_in_ms* > > > > I tried to restart cassandra by following the steps @ > > > >1. > > http://docs.datastax.com/en/cassandra/1.2/cassandra/operations/ops_replace_node_t.html >2. > > http://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html > > *except the "clear data" part as it was not specified in second blog > above.* > > > > i was trying to restart the same node that went down, however I did not > get the messages in log files as stated in 2 against "StorageService" > > > > instead it just tried to replay and then stopped with the error message as > below: > > > > *ERROR [main] 2015-11-16 15:27:22,944 CassandraDaemon.java (line 584) > Exception encountered during startup* > > *java.lang.RuntimeException: Cannot replace address with a node that is > already bootstrapped* > > > > Can someone please help me if there is something i am doing wrong here. > > > > Thanks for the help in advance. > > > > Regards, > > Anishek > > >
Re: handling down node cassandra 2.0.15
hey Anuj, Ok I will try that next time, so you are saying since i am replacing the machine in place(trying to get the same machine back in cluster) which already has some data, I dont clean the commitlogs/data directories and set auto_bootstrap = false and then restart the node, followed by repair on this machine right ? thanks anishek On Mon, Nov 16, 2015 at 11:40 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: > Hi Abhishek, > > In my opinion, you already have data and bootstrapping is not needed here. > You can set auto_bootstrap to false in Cassandra.yaml and once the > cassandra is rebooted, you should run repair to fix the inconsistent data. > > > Thanks > Anuj > > > > On Monday, 16 November 2015 10:34 PM, Josh Smith < > josh.sm...@careerbuilder.com> wrote: > > > Sis you set the JVM_OPTS to replace address? That is usually the error I > get when I forget to set the replace_address on Cassandra-env. > > JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=address_of_dead_node > > > *From:* Anishek Agarwal [mailto:anis...@gmail.com] > *Sent:* Monday, November 16, 2015 9:25 AM > *To:* user@cassandra.apache.org > *Subject:* Re: handling down node cassandra 2.0.15 > > nope its not > > On Mon, Nov 16, 2015 at 5:48 PM, sai krishnam raju potturi < > pskraj...@gmail.com> wrote: > > Is that a seed node? > > On Mon, Nov 16, 2015, 05:21 Anishek Agarwal <anis...@gmail.com> wrote: > > Hello, > > We are having a 3 node cluster and one of the node went down due to a > hardware memory failure looks like. We followed the steps below after the > node was down for more than the default value of *max_hint_window_in_ms* > > I tried to restart cassandra by following the steps @ > > >1. > > http://docs.datastax.com/en/cassandra/1.2/cassandra/operations/ops_replace_node_t.html >2. > > http://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html > > *except the "clear data" part as it was not specified in second blog > above.* > > i was trying to restart the same node that went down, however I did not > get the messages in log files as stated in 2 against "StorageService" > > instead it just tried to replay and then stopped with the error message as > below: > > *ERROR [main] 2015-11-16 15:27:22,944 CassandraDaemon.java (line 584) > Exception encountered during startup* > *java.lang.RuntimeException: Cannot replace address with a node that is > already bootstrapped* > > Can someone please help me if there is something i am doing wrong here. > > Thanks for the help in advance. > > Regards, > Anishek > > > > >
handling down node cassandra 2.0.15
Hello, We are having a 3 node cluster and one of the node went down due to a hardware memory failure looks like. We followed the steps below after the node was down for more than the default value of *max_hint_window_in_ms* I tried to restart cassandra by following the steps @ 1. http://docs.datastax.com/en/cassandra/1.2/cassandra/operations/ops_replace_node_t.html 2. http://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html *except the "clear data" part as it was not specified in second blog above.* i was trying to restart the same node that went down, however I did not get the messages in log files as stated in 2 against "StorageService" instead it just tried to replay and then stopped with the error message as below: *ERROR [main] 2015-11-16 15:27:22,944 CassandraDaemon.java (line 584) Exception encountered during startup* *java.lang.RuntimeException: Cannot replace address with a node that is already bootstrapped* Can someone please help me if there is something i am doing wrong here. Thanks for the help in advance. Regards, Anishek
Re: terrible read/write latency fluctuation
if its some sort of timeseries DTCS might turn out to be better for compaction. also some disk monitoring might help to understand if disk is the bottleneck. On Sun, Oct 25, 2015 at 3:47 PM, 曹志富wrote: > I will try to trace a read that take > 20msec > . > > just HDD.no delete just 60days ttl.value size is small ,max length is 140. > > > My data like Time Series . date of 90% reads which timestamp < 7days. > data almost just insert ,with a lit update. >
Re: compaction with LCS
Anyone has seen similar behavior with LCS, please do let me know, It will be good to know this can happen. On Fri, Oct 9, 2015 at 5:19 PM, Anishek Agarwal <anis...@gmail.com> wrote: > Looks like some of the nodes have higher sstables on L0 and compaction is > running there, so only few nodes run compaction at a time and the > preference is given to lower level nodes for compaction before going to > higher levels ? so is compaction cluster aware then ? > > > On Fri, Oct 9, 2015 at 5:17 PM, Anishek Agarwal <anis...@gmail.com> wrote: > >> hello, >> >> on doing cfstats for the column family i see >> >> SSTables in each level: [1, 10, 109/100, 1, 0, 0, 0, 0, 0] >> >> i thought compaction would trigger since the 3rd level tables are move >> than expected number, >> >> but on doing compactionstats its shows "n/a" -- any reason why its not >> triggering, should i be worried ? >> >> we have 5 node cluster running 2.0.15 cassandra version, >> >> thanks >> anishek >> > >
Re: compaction with LCS
Looks like some of the nodes have higher sstables on L0 and compaction is running there, so only few nodes run compaction at a time and the preference is given to lower level nodes for compaction before going to higher levels ? so is compaction cluster aware then ? On Fri, Oct 9, 2015 at 5:17 PM, Anishek Agarwal <anis...@gmail.com> wrote: > hello, > > on doing cfstats for the column family i see > > SSTables in each level: [1, 10, 109/100, 1, 0, 0, 0, 0, 0] > > i thought compaction would trigger since the 3rd level tables are move > than expected number, > > but on doing compactionstats its shows "n/a" -- any reason why its not > triggering, should i be worried ? > > we have 5 node cluster running 2.0.15 cassandra version, > > thanks > anishek >
compaction with LCS
hello, on doing cfstats for the column family i see SSTables in each level: [1, 10, 109/100, 1, 0, 0, 0, 0, 0] i thought compaction would trigger since the 3rd level tables are move than expected number, but on doing compactionstats its shows "n/a" -- any reason why its not triggering, should i be worried ? we have 5 node cluster running 2.0.15 cassandra version, thanks anishek
DTCS dropping of SST Tables
Hey all, We are using DTCS and we have a ttl of 30 days for all inserts, there are no deletes/updates we do. When the SST tables is dropped by DTCS what kind of logging do we see in C* logs. any help would be useful. The reason is my db size is not hovering around a size it is increasing, there has been no significant change in traffic that creates data in C*. thanks anishek
Strategy tools for taking snapshots to load in another cluster instance
Hello We have 5 node prod cluster and 3 node test cluster. Is there a way i can take snapshot of a table in prod and load it test cluster. The cassandra versions are same. Even if there is a tool that can help with this it will be great. If not, how do people handle scenarios where data in prod is required in staging/test clusters for testing to make sure things are correct ? Does the cluster size have to be same to allow copying of relevant snapshot data etc? thanks anishek
Re: handling down node cassandra 2.0.15
@Rob interesting something i will try next time, for step 3 you mentioned -- I just remove the -Dcassandra.join_ring=false option and restart the cassandra service? @Anuj, gc_grace_seconds dictates how long hinted handoff are stored right. These might be good where we explicitly delete values from the table. we just have ttl and DTCS should delete data older than 1 month. In this case do i need to wipe the node and then start copy of key space again ? or can i run a repair once it joins the right with auto_bootstrap=false. On Wed, Nov 18, 2015 at 1:20 AM, Robert Coliwrote: > On Tue, Nov 17, 2015 at 4:33 AM, Anuj Wadehra > wrote: > >> Only if gc_grace_seconds havent passed since the failure. If your machine >> is down for more than gc_grace_seconds you need to delete the data >> directory and go with auto bootstrap = true . >> > > Since CASSANDRA-6961 you can : > > 1) bring up the node with join_ring=false > 2) repair it > 3) join it to the cluster > > https://issues.apache.org/jira/browse/CASSANDRA-6961 > > This prevents you from decreasing your unique replica count, which is > usually a good thing! > > =Rob >
Re: Strategy tools for taking snapshots to load in another cluster instance
Peer, that talks about having a similar sized cluster, I was wondering if there is a way for moving from larger to smaller cluster. I will try a few things as soon as i get time and update here. On Thu, Nov 19, 2015 at 5:48 PM, Peer, Oded <oded.p...@rsa.com> wrote: > Have you read the DataStax documentation? > > > http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_snapshot_restore_new_cluster.html > > > > > > *From:* Romain Hardouin [mailto:romainh...@yahoo.fr] > *Sent:* Wednesday, November 18, 2015 3:59 PM > *To:* user@cassandra.apache.org > *Subject:* Re: Strategy tools for taking snapshots to load in another > cluster instance > > > > You can take a snapshot via nodetool then load sstables on your test > cluster with sstableloader: > docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsBulkloader_t.html > > > > Sent from Yahoo Mail on Android > <https://overview.mail.yahoo.com/mobile/?.src=Android> > -- > > *From*:"Anishek Agarwal" <anis...@gmail.com> > *Date*:Wed, Nov 18, 2015 at 11:24 > *Subject*:Strategy tools for taking snapshots to load in another cluster > instance > > Hello > > > > We have 5 node prod cluster and 3 node test cluster. Is there a way i can > take snapshot of a table in prod and load it test cluster. The cassandra > versions are same. > > > > Even if there is a tool that can help with this it will be great. > > > > If not, how do people handle scenarios where data in prod is required in > staging/test clusters for testing to make sure things are correct ? Does > the cluster size have to be same to allow copying of relevant snapshot data > etc? > > > > > > thanks > > anishek > > >
Re: High Bloom filter false ratio
Looks like that sstablemetadata is available in 2.2 , we are on 2.0.x do you know anything that will work on 2.0.x On Tue, Feb 23, 2016 at 1:48 PM, Anishek Agarwal <anis...@gmail.com> wrote: > Thanks Jeff, Awesome will look at the tools and JMX endpoint. > > our settings are below originated from the jira you posted above as the > base. we are running on 48 core machines with 2 SSD disks of 800 GB each . > > MAX_HEAP_SIZE="6G" > > HEAP_NEWSIZE="4G" > > JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" > > JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" > > JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" > > JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=6" > > JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=4" > > JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" > > JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly" > > JVM_OPTS="$JVM_OPTS -XX:+UseTLAB" > > JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m" > > JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts" > > JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops" > > JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" > > JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48" > > JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48" > > JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent" > > JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" > > JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" > > JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" > > # earlier value 131072 > > JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32678" > > JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600" > > JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32678" > > JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32678" > > > On Tue, Feb 23, 2016 at 1:06 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> > wrote: > >> There exists a JMX endpoint called forceUserDefinedCompaction that takes >> a comma separated list of sstables to compact together. >> >> There also exists a tool called sstablemetadata (may be in a >> ‘cassandra-tools’ package separate from whatever package you used to >> install cassandra, or in the tools/ directory of your binary package). >> Using sstablemetadata, you can look at the maxTimestamp for each table, and >> the ‘Estimated droppable tombstones’. Using those two fields, you could, >> very easily, write a script that gives you a list of sstables that you >> could feed to forceUserDefinedCompaction to join together to eliminate >> leftover waste. >> >> Your long ParNew times may be fixable by increasing the new gen size of >> your heap – the general guidance in cassandra-env.sh is out of date, you >> may want to reference CASSANDRA-8150 for “newer” advice ( >> http://issues.apache.org/jira/browse/CASSANDRA-8150 ) >> >> - Jeff >> >> From: Anishek Agarwal >> Reply-To: "user@cassandra.apache.org" >> Date: Monday, February 22, 2016 at 8:33 PM >> >> To: "user@cassandra.apache.org" >> Subject: Re: High Bloom filter false ratio >> >> Hey Jeff, >> >> Thanks for the clarification, I did not explain my self clearly, the >> max_stable_age_days >> is set to 30 days and the ttl on every insert is set to 30 days also >> by default. gc_grace_seconds is 0, so i would think the sstable as a whole >> would be deleted. >> >> Because of the problems mentioned by at 1) above it looks like, there >> might be cases where the table just lies around since no compaction is >> happening on it and even though everything is expired it would still not be >> deleted? >> >> for 3) the average read is pretty good, though the throughput doesn't >> seem to be that great, when no repair is running we get GCIns > 200ms every >> couple of hours once, otherwise its every 10-20 mins >> >> INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line >> 116) GC for ParNew: 205 ms for 1 collections, 1712439128 used; max is >> 7784628224 >> >> INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line >> 116) GC for ParNew: 242 ms for 1 collections, 1819126928 used; max is >> 7784628224 >> >> INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line >> 116) GC for ParNew: 374 ms for 1 collections, 1829660304 used; max is >> 7784628224 >> >> INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line >&
Re: High Bloom filter false ratio
Thanks Jeff, Awesome will look at the tools and JMX endpoint. our settings are below originated from the jira you posted above as the base. we are running on 48 core machines with 2 SSD disks of 800 GB each . MAX_HEAP_SIZE="6G" HEAP_NEWSIZE="4G" JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=6" JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=4" JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly" JVM_OPTS="$JVM_OPTS -XX:+UseTLAB" JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m" JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts" JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops" JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48" JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48" JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent" JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" # earlier value 131072 JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32678" JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600" JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32678" JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32678" On Tue, Feb 23, 2016 at 1:06 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> wrote: > There exists a JMX endpoint called forceUserDefinedCompaction that takes a > comma separated list of sstables to compact together. > > There also exists a tool called sstablemetadata (may be in a > ‘cassandra-tools’ package separate from whatever package you used to > install cassandra, or in the tools/ directory of your binary package). > Using sstablemetadata, you can look at the maxTimestamp for each table, and > the ‘Estimated droppable tombstones’. Using those two fields, you could, > very easily, write a script that gives you a list of sstables that you > could feed to forceUserDefinedCompaction to join together to eliminate > leftover waste. > > Your long ParNew times may be fixable by increasing the new gen size of > your heap – the general guidance in cassandra-env.sh is out of date, you > may want to reference CASSANDRA-8150 for “newer” advice ( > http://issues.apache.org/jira/browse/CASSANDRA-8150 ) > > - Jeff > > From: Anishek Agarwal > Reply-To: "user@cassandra.apache.org" > Date: Monday, February 22, 2016 at 8:33 PM > > To: "user@cassandra.apache.org" > Subject: Re: High Bloom filter false ratio > > Hey Jeff, > > Thanks for the clarification, I did not explain my self clearly, the > max_stable_age_days > is set to 30 days and the ttl on every insert is set to 30 days also > by default. gc_grace_seconds is 0, so i would think the sstable as a whole > would be deleted. > > Because of the problems mentioned by at 1) above it looks like, there > might be cases where the table just lies around since no compaction is > happening on it and even though everything is expired it would still not be > deleted? > > for 3) the average read is pretty good, though the throughput doesn't seem > to be that great, when no repair is running we get GCIns > 200ms every > couple of hours once, otherwise its every 10-20 mins > > INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line > 116) GC for ParNew: 205 ms for 1 collections, 1712439128 used; max is > 7784628224 > > INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line > 116) GC for ParNew: 242 ms for 1 collections, 1819126928 used; max is > 7784628224 > > INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line > 116) GC for ParNew: 374 ms for 1 collections, 1829660304 used; max is > 7784628224 > > INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line > 116) GC for ParNew: 419 ms for 1 collections, 2309875224 used; max is > 7784628224 > > INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line > 116) GC for ParNew: 231 ms for 1 collections, 2515325328 used; max is > 7784628224 > > INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line > 116) GC for ParNew: 252 ms for 1 collections, 1724241952 used; max is > 7784628224 > > > our reading patterns are dependent on BF to work efficiently as we do a > lot of reads for keys that may not exists because its time series and > we segregate data based on hourly boundary from epoch. > > > hey Christoper, > > yes eve
Re: Cassandra nodes reduce disks per node
perational point of view (very long operation + repair needed) >> >> Hope this long email will be useful, maybe should I blog about this. Let >> me know if the process above makes sense or if some things might be >> improved. >> >> C*heers, >> - >> Alain Rodriguez >> France >> >> The Last Pickle >> http://www.thelastpickle.com >> >> 2016-02-19 7:19 GMT+01:00 Branton Davis <branton.da...@spanning.com>: >> >>> Jan, thanks! That makes perfect sense to run a second time before >>> stopping cassandra. I'll add that in when I do the production cluster. >>> >>> On Fri, Feb 19, 2016 at 12:16 AM, Jan Kesten <j.kes...@enercast.de> >>> wrote: >>> >>>> Hi Branton, >>>> >>>> two cents from me - I didnt look through the script, but for the rsyncs >>>> I do pretty much the same when moving them. Since they are immutable I do a >>>> first sync while everything is up and running to the new location which >>>> runs really long. Meanwhile new ones are created and I sync them again >>>> online, much less files to copy now. After that I shutdown the node and my >>>> last rsync now has to copy only a few files which is quite fast and so the >>>> downtime for that node is within minutes. >>>> >>>> Jan >>>> >>>> >>>> >>>> Von meinem iPhone gesendet >>>> >>>> Am 18.02.2016 um 22:12 schrieb Branton Davis < >>>> branton.da...@spanning.com>: >>>> >>>> Alain, thanks for sharing! I'm confused why you do so many repetitive >>>> rsyncs. Just being cautious or is there another reason? Also, why do you >>>> have --delete-before when you're copying data to a temp (assumed empty) >>>> directory? >>>> >>>> On Thu, Feb 18, 2016 at 4:12 AM, Alain RODRIGUEZ <arodr...@gmail.com> >>>> wrote: >>>> >>>>> I did the process a few weeks ago and ended up writing a runbook and a >>>>> script. I have anonymised and share it fwiw. >>>>> >>>>> https://github.com/arodrime/cassandra-tools/tree/master/remove_disk >>>>> >>>>> It is basic bash. I tried to have the shortest down time possible, >>>>> making this a bit more complex, but it allows you to do a lot in parallel >>>>> and just do a fast operation sequentially, reducing overall operation >>>>> time. >>>>> >>>>> This worked fine for me, yet I might have make some errors while >>>>> making it configurable though variables. Be sure to be around if you >>>>> decide >>>>> to run this. Also I automated this more by using knife (Chef), I hate to >>>>> repeat ops, this is something you might want to consider. >>>>> >>>>> Hope this is useful, >>>>> >>>>> C*heers, >>>>> - >>>>> Alain Rodriguez >>>>> France >>>>> >>>>> The Last Pickle >>>>> http://www.thelastpickle.com >>>>> >>>>> 2016-02-18 8:28 GMT+01:00 Anishek Agarwal <anis...@gmail.com>: >>>>> >>>>>> Hey Branton, >>>>>> >>>>>> Please do let us know if you face any problems doing this. >>>>>> >>>>>> Thanks >>>>>> anishek >>>>>> >>>>>> On Thu, Feb 18, 2016 at 3:33 AM, Branton Davis < >>>>>> branton.da...@spanning.com> wrote: >>>>>> >>>>>>> We're about to do the same thing. It shouldn't be necessary to shut >>>>>>> down the entire cluster, right? >>>>>>> >>>>>>> On Wed, Feb 17, 2016 at 12:45 PM, Robert Coli <rc...@eventbrite.com> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Feb 16, 2016 at 11:29 PM, Anishek Agarwal < >>>>>>>> anis...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> To accomplish this can I just copy the data from disk1 to disk2 >>>>>>>>> with in the relevant cassandra home location folders, change the >>>>>>>>> cassanda.yaml configuration and restart the node. before starting i >>>>>>>>> will >>>>>>>>> shutdown the cluster. >>>>>>>>> >>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>> =Rob >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
Re: High Bloom filter false ratio
Hey Jeff, Thanks for the clarification, I did not explain my self clearly, the max_stable_age_days is set to 30 days and the ttl on every insert is set to 30 days also by default. gc_grace_seconds is 0, so i would think the sstable as a whole would be deleted. Because of the problems mentioned by at 1) above it looks like, there might be cases where the table just lies around since no compaction is happening on it and even though everything is expired it would still not be deleted? for 3) the average read is pretty good, though the throughput doesn't seem to be that great, when no repair is running we get GCIns > 200ms every couple of hours once, otherwise its every 10-20 mins INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line 116) GC for ParNew: 205 ms for 1 collections, 1712439128 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line 116) GC for ParNew: 242 ms for 1 collections, 1819126928 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line 116) GC for ParNew: 374 ms for 1 collections, 1829660304 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line 116) GC for ParNew: 419 ms for 1 collections, 2309875224 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line 116) GC for ParNew: 231 ms for 1 collections, 2515325328 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line 116) GC for ParNew: 252 ms for 1 collections, 1724241952 used; max is 7784628224 our reading patterns are dependent on BF to work efficiently as we do a lot of reads for keys that may not exists because its time series and we segregate data based on hourly boundary from epoch. hey Christoper, yes every row in the stable that should have been deleted has "d" in that column. Also the key for one of the row is as "key": "00080cdd5edd080006251000" how do i get it back to normal readable format to get the (long,long) -- composite partition key back? Looks like i have to force a major compaction to delete a lot of data ? are there any other solutions ? thanks anishek On Mon, Feb 22, 2016 at 11:21 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> wrote: > 1) getFullyExpiredSSTables in 2.0 isn’t as thorough as many expect, so > it’s very likely that some sstables stick around longer than you expect. > > 2) max_sstable_age_days tells cassandra when to stop compacting that file, > not when to delete it. > > 3) You can change the window size using both the base_time_seconds > parameter and max_sstable_age_days parameter (use the former to set the > size of the first window, and the latter to determine how long before you > stop compacting that window). It’s somewhat non-intuitive. > > Your read latencies actually look pretty reasonable, are you sure you’re > not simply hitting GC pauses that cause your queries to run longer than you > expect? Do you have graphs of GC time (first derivative of total gc time is > common for tools like graphite), or do you see ‘gcinspector’ in your logs > indicating pauses > 200ms? > > From: Anishek Agarwal > Reply-To: "user@cassandra.apache.org" > Date: Sunday, February 21, 2016 at 11:13 PM > To: "user@cassandra.apache.org" > Subject: Re: High Bloom filter false ratio > > Hey guys, > > Just did some more digging ... looks like DTCS is not removing old data > completely, I used sstable2json for one such table and saw old data there. > we have a value of 30 for max_stable_age_days for the table. > > One of the columns showed data as :["2015-12-10 11\\:03+0530:", > "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last > IS_MARKED_FOR_DELETE column ? > > I see data from 10 dec 2015 still there, looks like there are a few issues > with DTCS, Operationally what choices do i have to rectify this, We are on > version 2.0.15. > > thanks > anishek > > > > > On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwal <anis...@gmail.com> > wrote: > >> We are using DTCS have a 30 day window for them before they are cleaned >> up. I don't think with DTCS we can do anything about table sizing. Please >> do let me know if there are other ideas. >> >> On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia < >> chovatia.jayd...@gmail.com> wrote: >> >>> To me following three looks on higher side: >>> SSTable count: 1289 >>> >>> In order to reduce SSTable count see if you are compacting of not (If >>> using STCS). Is it possible to change this to LCS? >>> >>> >>> Number of keys (estimate): 345137664 (345M
Ops Centre Read Requests / TBL: Local Read Requests
Hello, I have installed Ops center 5.2.3 along with agents on three cassandra nodes in my test cluster version 2.0.15. This has two tables in one keyspace. I have a program that is reading values only from one of the tables(table1) with in a keyspace. I am looking at two graphs - Read Requests across Cluster -- (1) - TBL: Local Read across Cluster for table1 -- (2) I find that the (2) is having higher numbers than (1) almost twice as much, is there something i am measuring wrong? i would think (1) would always be higher than (2) . table1 has - has a composite partition key (long,long) - has a single clustering key (text) thanks Anishek
Re: Ops Centre Read Requests / TBL: Local Read Requests
Looks like (1) -- is analogous to client read requests, so if i do a request with LOCAL_QUORUM consistency level then (2) would be higher since the coordinator would send two requests out for every single read request it receives, is there any other possibility for the above behaviour ? On Mon, Feb 15, 2016 at 4:21 PM, Anishek Agarwal <anis...@gmail.com> wrote: > Hello, > > I have installed Ops center 5.2.3 along with agents on three cassandra > nodes in my test cluster version 2.0.15. This has two tables in one > keyspace. I have a program that is reading values only from one of the > tables(table1) with in a keyspace. > > I am looking at two graphs > >- Read Requests across Cluster -- (1) >- TBL: Local Read across Cluster for table1 -- (2) > > I find that the (2) is having higher numbers than (1) almost twice as > much, is there something i am measuring wrong? i would think (1) would > always be higher than (2) . > > table1 has > >- has a composite partition key (long,long) >- has a single clustering key (text) > > > thanks > Anishek >
Cassandra nodes reduce disks per node
Hello, We started with two 800GB SSD on each cassandra node based on our initial estimations of read/write rate. As we started on boarding additional traffic we find that CPU is becoming a bottleneck and we are not able to run the NICE jobs like compaction very well. We have started expanding the cluster and this would lead to less data per node. It looks like at this point once we expand the cluster, the current 2 X 800 GB SSD will be too much and it might be better to have just one SSD. To accomplish this can I just copy the data from disk1 to disk2 with in the relevant cassandra home location folders, change the cassanda.yaml configuration and restart the node. before starting i will shutdown the cluster. Thanks anishek
Re: Cassandra nodes reduce disks per node
Additional note we are using cassandra 2.0.15 have 5 nodes in cluster , going to expand to 8 nodes. On Wed, Feb 17, 2016 at 12:59 PM, Anishek Agarwal <anis...@gmail.com> wrote: > Hello, > > We started with two 800GB SSD on each cassandra node based on our initial > estimations of read/write rate. As we started on boarding additional > traffic we find that CPU is becoming a bottleneck and we are not able to > run the NICE jobs like compaction very well. We have started expanding the > cluster and this would lead to less data per node. It looks like at this > point once we expand the cluster, the current 2 X 800 GB SSD will be too > much and it might be better to have just one SSD. > > To accomplish this can I just copy the data from disk1 to disk2 with in > the relevant cassandra home location folders, change the cassanda.yaml > configuration and restart the node. before starting i will shutdown the > cluster. > > Thanks > anishek >
Re: Cassandra nodes reduce disks per node
Hey Branton, Please do let us know if you face any problems doing this. Thanks anishek On Thu, Feb 18, 2016 at 3:33 AM, Branton Davis <branton.da...@spanning.com> wrote: > We're about to do the same thing. It shouldn't be necessary to shut down > the entire cluster, right? > > On Wed, Feb 17, 2016 at 12:45 PM, Robert Coli <rc...@eventbrite.com> > wrote: > >> >> >> On Tue, Feb 16, 2016 at 11:29 PM, Anishek Agarwal <anis...@gmail.com> >> wrote: >>> >>> To accomplish this can I just copy the data from disk1 to disk2 with in >>> the relevant cassandra home location folders, change the cassanda.yaml >>> configuration and restart the node. before starting i will shutdown the >>> cluster. >>> >> >> Yes. >> >> =Rob >> >> > >
High Bloom filter false ratio
Hello, We have a table with composite partition key with humungous cardinality, its a combination of (long,long). On the table we have bloom_filter_fp_chance=0.01. On doing "nodetool cfstats" on the 5 nodes we have in the cluster we are seeing "Bloom filter false ratio:" in the range of 0.7 -0.9. I thought over time the bloom filter would adjust to the key space cardinality, we have been running the cluster for a long time now but have added significant traffic from Jan this year, which would not lead to writes in the db but would lead to high reads to see if are any values. Are there any settings that can be changed to allow better ratio. Thanks Anishek
Re: High Bloom filter false ratio
Hey all, @Jaydeep here is the cfstats output from one node. Read Count: 1721134722 Read Latency: 0.04268825050756254 ms. Write Count: 56743880 Write Latency: 0.014650376727851532 ms. Pending Tasks: 0 Table: user_stay_points SSTable count: 1289 Space used (live), bytes: 122141272262 Space used (total), bytes: 224227850870 Off heap memory used (total), bytes: 653827528 SSTable Compression Ratio: 0.4959736121441446 Number of keys (estimate): 345137664 Memtable cell count: 339034 Memtable data size, bytes: 106558314 Memtable switch count: 3266 Local read count: 1721134803 Local read latency: 0.048 ms Local write count: 56743898 Local write latency: 0.018 ms Pending tasks: 0 Bloom filter false positives: 40664437 Bloom filter false ratio: 0.69058 Bloom filter space used, bytes: 493777336 Bloom filter off heap memory used, bytes: 493767024 Index summary off heap memory used, bytes: 91677192 Compression metadata off heap memory used, bytes: 68383312 Compacted partition minimum bytes: 104 Compacted partition maximum bytes: 1629722 Compacted partition mean bytes: 1773 Average live cells per slice (last five minutes): 0.0 Average tombstones per slice (last five minutes): 0.0 @Tyler Hobbs we are using cassandra 2.0.15 so https://issues.apache.org/jira/browse/CASSANDRA-8525 shouldnt occur. Other problems looks like will be fixed in 3.0 .. we will mostly try and slot in an upgrade to 3.x version towards second quarter of this year. @Daemon Latencies seem to have higher ratios, attached is the graph. I am mostly trying to look at Bloom filters, because the way we do reads, we read data with non existent partition keys and it seems to be taking long to respond, like for 720 queries it takes 2 seconds, with all 721 queries not returning anything. the 720 queries are done in sequence of 180 queries each with 180 of them running in parallel. thanks anishek On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia < chovatia.jayd...@gmail.com> wrote: > How many partition keys exists for the table which shows this problem (or > provide nodetool cfstats for that table)? > > On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <daeme...@gmail.com> > wrote: > >> The bloom filter buckets the values in a small number of buckets. I have >> been surprised by how many cases I see with large cardinality where a few >> values populate a given bloom leaf, resulting in high false positives, and >> a surprising impact on latencies! >> >> Are you seeing 2:1 ranges between mean and worse case latencies (allowing >> for gc times)? >> >> Daemeon Reiydelle >> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote: >> >>> You can try slightly lowering the bloom_filter_fp_chance on your table. >>> >>> Otherwise, it's possible that you're repeatedly querying one or two >>> partitions that always trigger a bloom filter false positive. You could >>> try manually tracing a few queries on this table (for non-existent >>> partitions) to see if the bloom filter rejects them. >>> >>> Depending on your Cassandra version, your false positive ratio could be >>> inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525 >>> >>> There are also a couple of recent improvements to bloom filters: >>> * https://issues.apache.org/jira/browse/CASSANDRA-8413 >>> * https://issues.apache.org/jira/browse/CASSANDRA-9167 >>> >>> >>> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <anis...@gmail.com> >>> wrote: >>> >>>> Hello, >>>> >>>> We have a table with composite partition key with humungous >>>> cardinality, its a combination of (long,long). On the table we have >>>> bloom_filter_fp_chance=0.01. >>>> >>>> On doing "nodetool cfstats" on the 5 nodes we have in the cluster we >>>> are seeing "Bloom filter false ratio:" in the range of 0.7 -0.9. >>>> >>>> I thought over time the bloom filter would adjust to the key space >>>> cardinality, we have been running the cluster for a long time now but have >>>> added significant traffic from Jan this year, which would not lead to >>>> writes in the db but would lead to high reads to see if are any values. >>>> >>>> Are there any settings that can be changed to allow better ratio. >>>> >>>> Thanks >>>> Anishek >>>> >>> >>> >>> >>> -- >>> Tyler Hobbs >>> DataStax <http://datastax.com/> >>> >> >
Re: High Bloom filter false ratio
Hey guys, Just did some more digging ... looks like DTCS is not removing old data completely, I used sstable2json for one such table and saw old data there. we have a value of 30 for max_stable_age_days for the table. One of the columns showed data as :["2015-12-10 11\\:03+0530:", "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last IS_MARKED_FOR_DELETE column ? I see data from 10 dec 2015 still there, looks like there are a few issues with DTCS, Operationally what choices do i have to rectify this, We are on version 2.0.15. thanks anishek On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwal <anis...@gmail.com> wrote: > We are using DTCS have a 30 day window for them before they are cleaned > up. I don't think with DTCS we can do anything about table sizing. Please > do let me know if there are other ideas. > > On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia < > chovatia.jayd...@gmail.com> wrote: > >> To me following three looks on higher side: >> SSTable count: 1289 >> >> In order to reduce SSTable count see if you are compacting of not (If >> using STCS). Is it possible to change this to LCS? >> >> >> Number of keys (estimate): 345137664 (345M partition keys) >> >> I don't have any suggestion about reducing this unless you partition your >> data. >> >> >> Bloom filter space used, bytes: 493777336 (400MB is huge) >> >> If number of keys are reduced then this will automatically reduce bloom >> filter size I believe. >> >> >> >> Jaydeep >> >> On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <anis...@gmail.com> >> wrote: >> >>> Hey all, >>> >>> @Jaydeep here is the cfstats output from one node. >>> >>> Read Count: 1721134722 >>> >>> Read Latency: 0.04268825050756254 ms. >>> >>> Write Count: 56743880 >>> >>> Write Latency: 0.014650376727851532 ms. >>> >>> Pending Tasks: 0 >>> >>> Table: user_stay_points >>> >>> SSTable count: 1289 >>> >>> Space used (live), bytes: 122141272262 >>> >>> Space used (total), bytes: 224227850870 >>> >>> Off heap memory used (total), bytes: 653827528 >>> >>> SSTable Compression Ratio: 0.4959736121441446 >>> >>> Number of keys (estimate): 345137664 >>> >>> Memtable cell count: 339034 >>> >>> Memtable data size, bytes: 106558314 >>> >>> Memtable switch count: 3266 >>> >>> Local read count: 1721134803 >>> >>> Local read latency: 0.048 ms >>> >>> Local write count: 56743898 >>> >>> Local write latency: 0.018 ms >>> >>> Pending tasks: 0 >>> >>> Bloom filter false positives: 40664437 >>> >>> Bloom filter false ratio: 0.69058 >>> >>> Bloom filter space used, bytes: 493777336 >>> >>> Bloom filter off heap memory used, bytes: 493767024 >>> >>> Index summary off heap memory used, bytes: 91677192 >>> >>> Compression metadata off heap memory used, bytes: 68383312 >>> >>> Compacted partition minimum bytes: 104 >>> >>> Compacted partition maximum bytes: 1629722 >>> >>> Compacted partition mean bytes: 1773 >>> >>> Average live cells per slice (last five minutes): 0.0 >>> >>> Average tombstones per slice (last five minutes): 0.0 >>> >>> >>> @Tyler Hobbs >>> >>> we are using cassandra 2.0.15 so >>> https://issues.apache.org/jira/browse/CASSANDRA-8525 shouldnt occur. >>> Other problems looks like will be fixed in 3.0 .. we will mostly try and >>> slot in an upgrade to 3.x version towards second quarter of this year. >>> >>> >>> @Daemon >>> >>> Latencies seem to have higher ratios, attached is the graph. >>> >>> >>> I am mostly trying to look at Bloom filters, because the way we do >>> reads, we read data with non existent partition keys and it seems to be >>> taking long to respond, like for 720 queries it takes 2 seconds, with all >>> 721 queries not returning anything. the 720 queries are done in >>> sequence of 180 queries each with 180 of them running in parallel. >>> >>> >>> thanks >>> >>> anishek >>> >>> >>> >>> On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia < >>> chovatia.jayd...@gmail
Re: High Bloom filter false ratio
We are using DTCS have a 30 day window for them before they are cleaned up. I don't think with DTCS we can do anything about table sizing. Please do let me know if there are other ideas. On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia < chovatia.jayd...@gmail.com> wrote: > To me following three looks on higher side: > SSTable count: 1289 > > In order to reduce SSTable count see if you are compacting of not (If > using STCS). Is it possible to change this to LCS? > > > Number of keys (estimate): 345137664 (345M partition keys) > > I don't have any suggestion about reducing this unless you partition your > data. > > > Bloom filter space used, bytes: 493777336 (400MB is huge) > > If number of keys are reduced then this will automatically reduce bloom > filter size I believe. > > > > Jaydeep > > On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <anis...@gmail.com> > wrote: > >> Hey all, >> >> @Jaydeep here is the cfstats output from one node. >> >> Read Count: 1721134722 >> >> Read Latency: 0.04268825050756254 ms. >> >> Write Count: 56743880 >> >> Write Latency: 0.014650376727851532 ms. >> >> Pending Tasks: 0 >> >> Table: user_stay_points >> >> SSTable count: 1289 >> >> Space used (live), bytes: 122141272262 >> >> Space used (total), bytes: 224227850870 >> >> Off heap memory used (total), bytes: 653827528 >> >> SSTable Compression Ratio: 0.4959736121441446 >> >> Number of keys (estimate): 345137664 >> >> Memtable cell count: 339034 >> >> Memtable data size, bytes: 106558314 >> >> Memtable switch count: 3266 >> >> Local read count: 1721134803 >> >> Local read latency: 0.048 ms >> >> Local write count: 56743898 >> >> Local write latency: 0.018 ms >> >> Pending tasks: 0 >> >> Bloom filter false positives: 40664437 >> >> Bloom filter false ratio: 0.69058 >> >> Bloom filter space used, bytes: 493777336 >> >> Bloom filter off heap memory used, bytes: 493767024 >> >> Index summary off heap memory used, bytes: 91677192 >> >> Compression metadata off heap memory used, bytes: 68383312 >> >> Compacted partition minimum bytes: 104 >> >> Compacted partition maximum bytes: 1629722 >> >> Compacted partition mean bytes: 1773 >> >> Average live cells per slice (last five minutes): 0.0 >> >> Average tombstones per slice (last five minutes): 0.0 >> >> >> @Tyler Hobbs >> >> we are using cassandra 2.0.15 so >> https://issues.apache.org/jira/browse/CASSANDRA-8525 shouldnt occur. >> Other problems looks like will be fixed in 3.0 .. we will mostly try and >> slot in an upgrade to 3.x version towards second quarter of this year. >> >> >> @Daemon >> >> Latencies seem to have higher ratios, attached is the graph. >> >> >> I am mostly trying to look at Bloom filters, because the way we do reads, >> we read data with non existent partition keys and it seems to be taking >> long to respond, like for 720 queries it takes 2 seconds, with all 721 >> queries not returning anything. the 720 queries are done in sequence of >> 180 queries each with 180 of them running in parallel. >> >> >> thanks >> >> anishek >> >> >> >> On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia < >> chovatia.jayd...@gmail.com> wrote: >> >>> How many partition keys exists for the table which shows this problem >>> (or provide nodetool cfstats for that table)? >>> >>> On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <daeme...@gmail.com> >>> wrote: >>> >>>> The bloom filter buckets the values in a small number of buckets. I >>>> have been surprised by how many cases I see with large cardinality where a >>>> few values populate a given bloom leaf, resulting in high false positives, >>>> and a surprising impact on latencies! >>>> >>>> Are you seeing 2:1 ranges between mean and worse case latencies >>>> (allowing for gc times)? >>>> >>>> Daemeon Reiydelle >>>> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote: >>>> >>>>> You can try slightly lowering the bloom_filter_fp_chance on your table. >>>>> >>>>> Otherwise, it's possible that you're repeatedly querying one or two >>>>> partitions that always trigger a bloom filter false
Multi DC setup for analytics
Hello, We are using cassandra 2.0.17 and have two logical DC having different Keyspaces but both having same logical name DC1. we want to setup another cassandra cluster for analytics which should get data from both the above DC. if we setup the new DC with name DC2 and follow the steps https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html will it work ? I would think we would have to first change the names of existing clusters to have to different names and then go with adding another dc getting data from these? Also as soon as we add the node the data starts moving... this will all be only real time changes done to the cluster right ? we still have to do the rebuild to get the data for tokens for node in new cluster ? Thanks Anishek
repairs how do we schedule
Hello, we used to run repair on each node using https://github.com/BrianGallew/cassandra_range_repair.git. most of the time repairs finished in under 12 hrs per node, we had then 4 nodes. gradually the repair time kept increasing as traffic increased, we also added more nodes meanwhile, we have 7 nodes now and repair on one node takes 3 days almost for one CF we have 2 CF in there. Can we schedule multiple repairs at the same time ? we don't delete data explicitly, the are removed via TTL from one CF(using DTCS) and there is no delete operation on other CF(using LCS). thanks anishek
Re: Traffic inconsistent across nodes
We have two DC one with the above 8 nodes and other with 3 nodes. On Tue, Apr 12, 2016 at 8:06 PM, Eric Stevens <migh...@gmail.com> wrote: > Maybe include nodetool status here? Are the four nodes serving reads in > one DC (local to your driver's config) while the others are in another? > > On Tue, Apr 12, 2016, 1:01 AM Anishek Agarwal <anis...@gmail.com> wrote: > >> hello, >> >> we have 8 nodes in one cluster and attached is the traffic patterns >> across the nodes. >> >> its very surprising that only 4 nodes show transmitting (purple) packets. >> >> our driver configuration on clients has the following load balancing >> configuration : >> >> new TokenAwarePolicy( >> new >> DCAwareRoundRobinPolicy(configuration.get(Constants.LOCAL_DATA_CENTRE_NAME, >> "WDC")), >> true) >> >> >> any idea what is that we are missing which is leading to this skewed data >> read patterns >> >> cassandra drivers as below: >> >> >> com.datastax.cassandra >> cassandra-driver-core >> 2.1.6 >> >> >> com.datastax.cassandra >> cassandra-driver-mapping >> 2.1.6 >> >> >> cassandra version is 2.0.17 >> >> Thanks in advance for the help. >> >> Anishek >> >>
Re: disk space used vs nodetool status
Thanks Carlos, We didn't do any actions that would create a snapshot, and i couldn't find the command in 2.0.17, but i found the respective snapshot directories and they were created from more than a couple of months ago so, i it might be that i might have forgotten, its fine now, i have cleared them. anishek On Tue, Mar 22, 2016 at 3:20 PM, Carlos Alonso <i...@mrcalonso.com> wrote: > I'd say you have snapshots holding disk space. > > Check it with nodetool listsnapshots. A snapshot is automatically taken on > destructive actions (drop, truncate...) and is basically a hard link to the > involved SSTables, so it's not considered as data load from Cassandra but > it is effectively using disk space. > > Hope this helps. > > Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso> > > On 22 March 2016 at 07:57, Anishek Agarwal <anis...@gmail.com> wrote: > >> Hello, >> >> Using cassandra 2.0.17 on one of the 7 nodes i see that the "Load" >> column from nodetool status >> shows around 279.34 GB where as doing df -h on the two mounted disks the >> total is about 400GB any reason of why this difference could show up and >> how do i go about finding the cause for this ? >> >> Thanks In Advance. >> Anishek >> > >
Re: Multi DC setup for analytics
Hey Clint, we have two separate rings which don't talk to each other but both having the same DC name "DCX". @Raja, We had already gone towards the path you suggested. thanks all anishek On Fri, Mar 18, 2016 at 8:01 AM, Reddy Raja <areddyr...@gmail.com> wrote: > Yes. Here are the steps. > You will have to change the DC Names first. > DC1 and DC2 would be independent clusters. > > Create a new DC, DC3 and include these two DC's on DC3. > > This should work well. > > > On Thu, Mar 17, 2016 at 11:03 PM, Clint Martin < > clintlmar...@coolfiretechnologies.com> wrote: > >> When you say you have two logical DC both with the same name are you >> saying that you have two clusters of servers both with the same DC name, >> nether of which currently talk to each other? IE they are two separate >> rings? >> >> Or do you mean that you have two keyspaces in one cluster? >> >> Or? >> >> Clint >> On Mar 14, 2016 2:11 AM, "Anishek Agarwal" <anis...@gmail.com> wrote: >> >>> Hello, >>> >>> We are using cassandra 2.0.17 and have two logical DC having different >>> Keyspaces but both having same logical name DC1. >>> >>> we want to setup another cassandra cluster for analytics which should >>> get data from both the above DC. >>> >>> if we setup the new DC with name DC2 and follow the steps >>> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html >>> will it work ? >>> >>> I would think we would have to first change the names of existing >>> clusters to have to different names and then go with adding another dc >>> getting data from these? >>> >>> Also as soon as we add the node the data starts moving... this will all >>> be only real time changes done to the cluster right ? we still have to do >>> the rebuild to get the data for tokens for node in new cluster ? >>> >>> Thanks >>> Anishek >>> >> > > > -- > "In this world, you either have an excuse or a story. I preferred to have > a story" >
Re: Lot of GC on two nodes out of 7
Hello, Bryan, most of the partition sizes are under 45 KB I have tried with concurrent_compactors : 8 for one of the nodes still no improvement, I have tried max_heap_Size : 8G, no improvement. I will try the newHeapsize of 2G though i am sure CMS will be a longer then. Also doesn't look like i mentioned what type of GC was causing the problems. On both the nodes its the ParNewGC thats taking long for each run and too many runs are happening in succession. anishek On Fri, Mar 4, 2016 at 5:36 AM, Bryan Cheng <br...@blockcypher.com> wrote: > Hi Anishek, > > In addition to the good advice others have given, do you notice any > abnormally large partitions? What does cfhistograms report for 99% > partition size? A few huge partitions will cause very disproportionate load > on your cluster, including high GC. > > --Bryan > > On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F <amit.f.si...@ericsson.com> > wrote: > >> Hi Anishek, >> >> >> >> We too faced similar problem in 2.0.14 and after doing some research we >> config few parameters in Cassandra.yaml and was able to overcome GC pauses >> . Those are : >> >> >> >> · memtable_flush_writers : increased from 1 to 3 as from tpstats >> output we can see mutations dropped so it means writes are getting >> blocked, so increasing number will have those catered. >> >> · memtable_total_space_in_mb : Default (1/4 of heap size), can >> lowered because larger long lived objects will create pressure on HEAP, so >> its better to reduce some amount of size. >> >> · Concurrent_compactors : Alain righlty pointed out this i.e >> reduce it to 8. You need to try this. >> >> >> >> Also please check whether you have mutations drop in other nodes or not. >> >> >> >> Hope this helps in your cluster too. >> >> >> >> Regards >> >> Amit Singh >> >> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com] >> *Sent:* Wednesday, March 02, 2016 9:33 PM >> *To:* user@cassandra.apache.org >> *Subject:* Re: Lot of GC on two nodes out of 7 >> >> >> >> Can you post a gist of the output of jstat -gccause (60 seconds worth)? >> I think it's cool you're willing to experiment with alternative JVM >> settings but I've never seen anyone use max tenuring threshold of 50 either >> and I can't imagine it's helpful. Keep in mind if your objects are >> actually reaching that threshold it means they've been copied 50x (really >> really slow) and also you're going to end up spilling your eden objects >> directly into your old gen if your survivor is full. Considering the small >> amount of memory you're using for heap I'm really not surprised you're >> running into problems. >> >> >> >> I recommend G1GC + 12GB heap and just let it optimize itself for almost >> all cases with the latest JVM versions. >> >> >> >> On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ <arodr...@gmail.com> >> wrote: >> >> It looks like you are doing a good work with this cluster and know a lot >> about JVM, that's good :-). >> >> >> >> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM >> >> >> >> That's good hardware too. >> >> >> >> With 64 GB of ram I would probably directly give a try to >> `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably. >> >> >> >> Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using >> `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the >> effects. But that's just an idea of something I would try to see the >> impacts, I don't think it will solve your current issues or even make it >> worse for this node. >> >> >> >> Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would >> allow you to store the memtables off-heap. Those are 2 improvements >> reducing the heap pressure that you might be interested in. >> >> >> >> I have spent time reading about all other options before including them >> and a similar configuration on our other prod cluster is showing good GC >> graphs via gcviewer. >> >> >> >> So, let's look for an other reason. >> >> >> >> there are MUTATION and READ messages dropped in high number on nodes in >> question and on other 5 nodes it varies between 1-3. >> >> >> >> - Is Memory, CPU or disk a bottleneck? Is one of those running at the >> limits? >> >> >> >> concurren
Re: Lot of GC on two nodes out of 7
Hey Jeff, one of the nodes with high GC has 1400 SST tables, all other nodes have about 500-900 SST tables. the other node with high GC has 636 SST tables. the average row size for compacted partitions is about 1640 bytes on all nodes. We have replication factor 3 but the problem is only on two nodes. the only other thing that stands out in cfstats is the read time and write time on the nodes with high GC is 5-7 times higher than other 5 nodes, but i think thats expected. thanks anishek On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> wrote: > Compaction falling behind will likely cause additional work on reads (more > sstables to merge), but I’d be surprised if it manifested in super long GC. > When you say twice as many sstables, how many is that?. > > In cfstats, does anything stand out? Is max row size on those nodes larger > than on other nodes? > > What you don’t show in your JVM options is the new gen size – if you do > have unusually large partitions on those two nodes (especially likely if > you have rf=2 – if you have rf=3, then there’s probably a third node > misbehaving you haven’t found yet), then raising new gen size can help > handle the garbage created by reading large partitions without having to > tolerate the promotion. Estimates for the amount of garbage vary, but it > could be “gigabytes” of garbage on a very wide partition (see > https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in progress > to help mitigate that type of pain). > > - Jeff > > From: Anishek Agarwal > Reply-To: "user@cassandra.apache.org" > Date: Tuesday, March 1, 2016 at 11:12 PM > To: "user@cassandra.apache.org" > Subject: Lot of GC on two nodes out of 7 > > Hello, > > we have a cassandra cluster of 7 nodes, all of them have the same JVM GC > configurations, all our writes / reads use the TokenAware Policy wrapping > a DCAware policy. All nodes are part of same Datacenter. > > We are seeing that two nodes are having high GC collection times. Then > mostly seem to spend time in GC like about 300-600 ms. This also seems to > result in higher CPU utilisation on these machines. Other 5 nodes don't > have this problem. > > There is no additional repair activity going on the cluster, we are not > sure why this is happening. > we checked cfhistograms on the two CF we have in the cluster and number of > reads seems to be almost same. > > we also used cfstats to see the number of ssttables on each node and one > of the nodes with the above problem has twice the number of ssttables than > other nodes. This still doesnot explain why two nodes have high GC > Overheads. our GC config is as below: > > JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" > > JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" > > JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" > > JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" > > JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50" > > JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" > > JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly" > > JVM_OPTS="$JVM_OPTS -XX:+UseTLAB" > > JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m" > > JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts" > > JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops" > > JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" > > JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48" > > JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48" > > JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent" > > JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" > > JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" > > JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" > > # earlier value 131072 = 32768 * 4 > > JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072" > > JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600" > > JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768" > > JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768" > > #new > > JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled" > > We are using cassandra 2.0.17. If anyone has any suggestion as to how what > else we can look for to understand why this is happening please do reply. > > > > Thanks > anishek > > >
Lot of GC on two nodes out of 7
Hello, we have a cassandra cluster of 7 nodes, all of them have the same JVM GC configurations, all our writes / reads use the TokenAware Policy wrapping a DCAware policy. All nodes are part of same Datacenter. We are seeing that two nodes are having high GC collection times. Then mostly seem to spend time in GC like about 300-600 ms. This also seems to result in higher CPU utilisation on these machines. Other 5 nodes don't have this problem. There is no additional repair activity going on the cluster, we are not sure why this is happening. we checked cfhistograms on the two CF we have in the cluster and number of reads seems to be almost same. we also used cfstats to see the number of ssttables on each node and one of the nodes with the above problem has twice the number of ssttables than other nodes. This still doesnot explain why two nodes have high GC Overheads. our GC config is as below: JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50" JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly" JVM_OPTS="$JVM_OPTS -XX:+UseTLAB" JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m" JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts" JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops" JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48" JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48" JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent" JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" # earlier value 131072 = 32768 * 4 JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072" JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600" JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768" JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768" #new JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled" We are using cassandra 2.0.17. If anyone has any suggestion as to how what else we can look for to understand why this is happening please do reply. Thanks anishek
Re: Lot of GC on two nodes out of 7
@Jeff i was just trying to follow some more advice given above, I personally still think a larger newGen heap size would be better. @Johnathan I will post the whole logs, I have restarted the nodes with additional changes most probably tomorrow or day after i will put out the gc logs. the problem still exists on two nodes. too much time spent in GC, additionally I tried to print the state of cluster via my application to see what is happening and i see that the node with high GC has a lot of "inflight Queries" -- almost 1100 and other nodes is all 0. the cfhistograms for all nodes show the approx the same number of reads. -- so i am thinking the above phenomenon is happening since the node is spending time in gc. also looking at the Load Balancing policy on client its new TokenAwarePolicy(new DCAwareRoundRobinPolicy()) if you have any other ideas please keep posting them. thanks anishek On Sat, Mar 5, 2016 at 12:54 AM, Jonathan Haddad <j...@jonhaddad.com> wrote: > Without looking at your GC logs (you never posted a gist), my assumption > would be you're doing a lot of copying between survivor generations, and > they're taking a long time. You're probably also copying a lot of data to > your old gen as a result of having full-ish survivor spaces to begin with. > > On Thu, Mar 3, 2016 at 10:26 PM Jeff Jirsa <jeff.ji...@crowdstrike.com> > wrote: > >> I’d personally would have gone the other way – if you’re seeing parnew, >> increasing new gen instead of decreasing it should help drop (faster) >> rather than promoting to sv/oldgen (slower) ? >> >> >> >> From: Anishek Agarwal >> Reply-To: "user@cassandra.apache.org" >> Date: Thursday, March 3, 2016 at 8:55 PM >> >> To: "user@cassandra.apache.org" >> Subject: Re: Lot of GC on two nodes out of 7 >> >> Hello, >> >> Bryan, most of the partition sizes are under 45 KB >> >> I have tried with concurrent_compactors : 8 for one of the nodes still no >> improvement, >> I have tried max_heap_Size : 8G, no improvement. >> >> I will try the newHeapsize of 2G though i am sure CMS will be a longer >> then. >> >> Also doesn't look like i mentioned what type of GC was causing the >> problems. On both the nodes its the ParNewGC thats taking long for each run >> and too many runs are happening in succession. >> >> anishek >> >> >> On Fri, Mar 4, 2016 at 5:36 AM, Bryan Cheng <br...@blockcypher.com> >> wrote: >> >>> Hi Anishek, >>> >>> In addition to the good advice others have given, do you notice any >>> abnormally large partitions? What does cfhistograms report for 99% >>> partition size? A few huge partitions will cause very disproportionate load >>> on your cluster, including high GC. >>> >>> --Bryan >>> >>> On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F <amit.f.si...@ericsson.com> >>> wrote: >>> >>>> Hi Anishek, >>>> >>>> >>>> >>>> We too faced similar problem in 2.0.14 and after doing some research we >>>> config few parameters in Cassandra.yaml and was able to overcome GC pauses >>>> . Those are : >>>> >>>> >>>> >>>> · memtable_flush_writers : increased from 1 to 3 as from >>>> tpstats output we can see mutations dropped so it means writes are getting >>>> blocked, so increasing number will have those catered. >>>> >>>> · memtable_total_space_in_mb : Default (1/4 of heap size), can >>>> lowered because larger long lived objects will create pressure on HEAP, so >>>> its better to reduce some amount of size. >>>> >>>> · Concurrent_compactors : Alain righlty pointed out this i.e >>>> reduce it to 8. You need to try this. >>>> >>>> >>>> >>>> Also please check whether you have mutations drop in other nodes or not. >>>> >>>> >>>> >>>> Hope this helps in your cluster too. >>>> >>>> >>>> >>>> Regards >>>> >>>> Amit Singh >>>> >>>> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com] >>>> *Sent:* Wednesday, March 02, 2016 9:33 PM >>>> *To:* user@cassandra.apache.org >>>> *Subject:* Re: Lot of GC on two nodes out of 7 >>>> >>>> >>>> >>>> Can you post a gist of the output of jstat -gccause (60 seconds >>>> worth)? I think it's cool you're will
Re: Lot of GC on two nodes out of 7
t; might want to keep this as it or even reduce it if you have less than 16 GB > of native memory. Go with 8 GB if you have a lot of memory. > `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so > far. I had luck with values between 4 <--> 16 in the past. I would give a > try with 15. > `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ? > Using default and then tune from there to improve things is generally a > good idea. > > You also use a bunch of option I don't know about, if you are uncertain > about them, you could try a default conf without the options you added and > just the using the changes above from default > https://github.com/apache/cassandra/blob/cassandra-2.0/conf/cassandra-env.sh. > Or you might find more useful information on a nice reference about this > topic which is Al Tobey's blog post about tuning 2.1. Go to the 'Java > Virtual Machine' part: > https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html > > FWIW, I also saw improvement in the past by upgrading to 2.1, Java 8 and > G1GC. G1GC is supposed to be easier to configure too. > > the average row size for compacted partitions is about 1640 bytes on all >> nodes. We have replication factor 3 but the problem is only on two nodes. >> > > I think Jeff is trying to spot a wide row messing with your system, so > looking at the max row size on those nodes compared to other is more > relevant than average size for this check. > > the only other thing that stands out in cfstats is the read time and write >> time on the nodes with high GC is 5-7 times higher than other 5 nodes, but >> i think thats expected. > > > I would probably look at this the reverse way: I imagine that extra GC is > a consequence of something going wrong on those nodes as JVM / GC are > configured the same way cluster-wide. GC / JVM issues are often due to > Cassandra / system / hardware issues, inducing extra pressure on the JVM. I > would try to tune JVM / GC only once the system is healthy. So I often saw > high GC being a consequence rather than the root cause of an issue. > > To explore this possibility: > > Does this command show some dropped or blocked tasks? This would add > pressure to heap. > nodetool tpstats > > Do you have errors in logs? Always good to know when facing an issue. > grep -i "ERROR" /var/log/cassandra/system.log > > How are compactions tuned (throughput + concurrent compactors)? This > tuning might explain compactions not keeping up or a high GC pressure. > > What are your disks / CPU? To help us giving you good arbitrary values to > try. > > Is there some iowait ? Could point to a bottleneck or bad hardware. > iostats -mx 5 100 > > ... > > Hope one of those will point you to an issue, but there are many more > thing you could check. > > Let us know how it goes, > > C*heers, > --- > Alain Rodriguez - al...@thelastpickle.com > France > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > > > 2016-03-02 10:33 GMT+01:00 Anishek Agarwal <anis...@gmail.com>: > >> also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G. >> >> On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <anis...@gmail.com> >> wrote: >> >>> Hey Jeff, >>> >>> one of the nodes with high GC has 1400 SST tables, all other nodes have >>> about 500-900 SST tables. the other node with high GC has 636 SST tables. >>> >>> the average row size for compacted partitions is about 1640 bytes on all >>> nodes. We have replication factor 3 but the problem is only on two nodes. >>> the only other thing that stands out in cfstats is the read time and >>> write time on the nodes with high GC is 5-7 times higher than other 5 >>> nodes, but i think thats expected. >>> >>> thanks >>> anishek >>> >>> >>> >>> >>> On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> >>> wrote: >>> >>>> Compaction falling behind will likely cause additional work on reads >>>> (more sstables to merge), but I’d be surprised if it manifested in super >>>> long GC. When you say twice as many sstables, how many is that?. >>>> >>>> In cfstats, does anything stand out? Is max row size on those nodes >>>> larger than on other nodes? >>>> >>>> What you don’t show in your JVM options is the new gen size – if you do >>>> have unusually large partitions on those two nodes (especially likely if >>>> you have rf=2 – if
Re: Lot of GC on two nodes out of 7
also MAX_HEAP_SIZE=6G and HEAP_NEWSIZE=4G. On Wed, Mar 2, 2016 at 1:40 PM, Anishek Agarwal <anis...@gmail.com> wrote: > Hey Jeff, > > one of the nodes with high GC has 1400 SST tables, all other nodes have > about 500-900 SST tables. the other node with high GC has 636 SST tables. > > the average row size for compacted partitions is about 1640 bytes on all > nodes. We have replication factor 3 but the problem is only on two nodes. > the only other thing that stands out in cfstats is the read time and write > time on the nodes with high GC is 5-7 times higher than other 5 nodes, but > i think thats expected. > > thanks > anishek > > > > > On Wed, Mar 2, 2016 at 1:09 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com> > wrote: > >> Compaction falling behind will likely cause additional work on reads >> (more sstables to merge), but I’d be surprised if it manifested in super >> long GC. When you say twice as many sstables, how many is that?. >> >> In cfstats, does anything stand out? Is max row size on those nodes >> larger than on other nodes? >> >> What you don’t show in your JVM options is the new gen size – if you do >> have unusually large partitions on those two nodes (especially likely if >> you have rf=2 – if you have rf=3, then there’s probably a third node >> misbehaving you haven’t found yet), then raising new gen size can help >> handle the garbage created by reading large partitions without having to >> tolerate the promotion. Estimates for the amount of garbage vary, but it >> could be “gigabytes” of garbage on a very wide partition (see >> https://issues.apache.org/jira/browse/CASSANDRA-9754 for work in >> progress to help mitigate that type of pain). >> >> - Jeff >> >> From: Anishek Agarwal >> Reply-To: "user@cassandra.apache.org" >> Date: Tuesday, March 1, 2016 at 11:12 PM >> To: "user@cassandra.apache.org" >> Subject: Lot of GC on two nodes out of 7 >> >> Hello, >> >> we have a cassandra cluster of 7 nodes, all of them have the same JVM GC >> configurations, all our writes / reads use the TokenAware Policy wrapping >> a DCAware policy. All nodes are part of same Datacenter. >> >> We are seeing that two nodes are having high GC collection times. Then >> mostly seem to spend time in GC like about 300-600 ms. This also seems to >> result in higher CPU utilisation on these machines. Other 5 nodes don't >> have this problem. >> >> There is no additional repair activity going on the cluster, we are not >> sure why this is happening. >> we checked cfhistograms on the two CF we have in the cluster and number >> of reads seems to be almost same. >> >> we also used cfstats to see the number of ssttables on each node and one >> of the nodes with the above problem has twice the number of ssttables than >> other nodes. This still doesnot explain why two nodes have high GC >> Overheads. our GC config is as below: >> >> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" >> >> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" >> >> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" >> >> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" >> >> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=50" >> >> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" >> >> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly" >> >> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB" >> >> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m" >> >> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts" >> >> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops" >> >> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" >> >> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48" >> >> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48" >> >> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent" >> >> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" >> >> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" >> >> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" >> >> # earlier value 131072 = 32768 * 4 >> >> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=131072" >> >> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600" >> >> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32768" >> >> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32768" >> >> #new >> >> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled" >> >> We are using cassandra 2.0.17. If anyone has any suggestion as to how >> what else we can look for to understand why this is happening please do >> reply. >> >> >> >> Thanks >> anishek >> >> >> >
Re: Multi DC setup for analytics
Hey Guys, We did the necessary changes and were trying to get this back on track, but hit another wall, we have two Clusters in Different DC ( DC1 and DC2) with cluster names ( CLUSTER_1, CLUSTER_2) we want to have a common analytics cluster in DC3 with cluster name (CLUSTER_3). -- looks like this can't be done, so we have to setup two different analytics cluster ? can't we just get data from CLUSTER_1/2 to same cluster CLUSTER_3 ? thanks anishek On Mon, Mar 21, 2016 at 3:31 PM, Anishek Agarwal <anis...@gmail.com> wrote: > Hey Clint, > > we have two separate rings which don't talk to each other but both having > the same DC name "DCX". > > @Raja, > > We had already gone towards the path you suggested. > > thanks all > anishek > > On Fri, Mar 18, 2016 at 8:01 AM, Reddy Raja <areddyr...@gmail.com> wrote: > >> Yes. Here are the steps. >> You will have to change the DC Names first. >> DC1 and DC2 would be independent clusters. >> >> Create a new DC, DC3 and include these two DC's on DC3. >> >> This should work well. >> >> >> On Thu, Mar 17, 2016 at 11:03 PM, Clint Martin < >> clintlmar...@coolfiretechnologies.com> wrote: >> >>> When you say you have two logical DC both with the same name are you >>> saying that you have two clusters of servers both with the same DC name, >>> nether of which currently talk to each other? IE they are two separate >>> rings? >>> >>> Or do you mean that you have two keyspaces in one cluster? >>> >>> Or? >>> >>> Clint >>> On Mar 14, 2016 2:11 AM, "Anishek Agarwal" <anis...@gmail.com> wrote: >>> >>>> Hello, >>>> >>>> We are using cassandra 2.0.17 and have two logical DC having different >>>> Keyspaces but both having same logical name DC1. >>>> >>>> we want to setup another cassandra cluster for analytics which should >>>> get data from both the above DC. >>>> >>>> if we setup the new DC with name DC2 and follow the steps >>>> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html >>>> will it work ? >>>> >>>> I would think we would have to first change the names of existing >>>> clusters to have to different names and then go with adding another dc >>>> getting data from these? >>>> >>>> Also as soon as we add the node the data starts moving... this will all >>>> be only real time changes done to the cluster right ? we still have to do >>>> the rebuild to get the data for tokens for node in new cluster ? >>>> >>>> Thanks >>>> Anishek >>>> >>> >> >> >> -- >> "In this world, you either have an excuse or a story. I preferred to have >> a story" >> > >
Re: Acceptable repair time
we have about 380GB / RF = 3 ~ 1200 GB on disk. since we are on 2.0.17 there is no incremental repair :( On Tue, Mar 29, 2016 at 6:05 PM, Kai Wang <dep...@gmail.com> wrote: > IIRC when we switched to LCS and ran the first full repair with > 250GB/RF=3, it took at least 12 hours for the repair to finish, then > another 3+ days for all the compaction to catch up. I called it "the big > bang of LCS". > > Since then we've been running nightly incremental repair. > > For me as long as it's reliable (no streaming error, better progress > reporting etc), I actually don't mind it it takes more than a few hours to > do a full repair. But I am not sure about 4 days... I guess it depends on > the size of the cluster and data... > > On Tue, Mar 29, 2016 at 6:04 AM, Anishek Agarwal <anis...@gmail.com> > wrote: > >> I would really like to know the answer for above because on some nodes >> repair takes almost 4 days for us :(. >> >> On Tue, Mar 29, 2016 at 8:34 AM, Jack Krupansky <jack.krupan...@gmail.com >> > wrote: >> >>> Someone recently asked me for advice when their repair time was 2-3 >>> days. I thought that was outrageous, but not unheard of. Personally, to me, >>> 2-3 hours would be about the limit of what I could tolerate, and my >>> personal goal would be that a full repair of a node should take no longer >>> than an hour, maybe 90 minutes tops. But... achieving those more >>> abbreviated repair times would strongly suggest that the amount of data on >>> each node be kept down to a tiny fraction of a typical spinning disk drive, >>> or even a fraction of a larger SSD drive. >>> >>> So, my question here is what people consider acceptable full repair >>> times for nodes and what the resulting node data size is. >>> >>> What impact vnodes has on these numbers is a bonus question. >>> >>> Thanks! >>> >>> -- Jack Krupansky >>> >> >> >
Re: Acceptable repair time
I would really like to know the answer for above because on some nodes repair takes almost 4 days for us :(. On Tue, Mar 29, 2016 at 8:34 AM, Jack Krupanskywrote: > Someone recently asked me for advice when their repair time was 2-3 days. > I thought that was outrageous, but not unheard of. Personally, to me, 2-3 > hours would be about the limit of what I could tolerate, and my personal > goal would be that a full repair of a node should take no longer than an > hour, maybe 90 minutes tops. But... achieving those more abbreviated repair > times would strongly suggest that the amount of data on each node be kept > down to a tiny fraction of a typical spinning disk drive, or even a > fraction of a larger SSD drive. > > So, my question here is what people consider acceptable full repair times > for nodes and what the resulting node data size is. > > What impact vnodes has on these numbers is a bonus question. > > Thanks! > > -- Jack Krupansky >
Re: Multi DC setup for analytics
Hey Bryan, Thanks for the info, we inferred as much, currently the only other thing we were trying were trying to start two separate instances in Analytics cluster on same set of machines to talk to respective individual DC's but within 2 mins dropped that as we will have to change ports on atlas one of the existing DC's so when they join with the analytics cluster they are on same port. for now we are just getting another set of machines for this. I had known about the pattern of using a separate analytics cluster for cassandra but thought we could join them across two clusters, my bad now that i think of it i think it would have been better to have just one DC for realtime prod requests instead of two. are there ways of merging existing clusters to one cluster in cassandra ? On Fri, Apr 1, 2016 at 5:05 AM, Bryan Cheng <br...@blockcypher.com> wrote: > I'm jumping into this thread late, so sorry if this has been covered > before. But am I correct in reading that you have two different Cassandra > rings, not talking to each other at all, and you want to have a shared DC > with a third Cassandra ring? > > I'm not sure what you want to do is possible. > > If I had the luxury of starting from scratch, the design I would do is: > All three DC's in one cluster, with 3 datacenters. DC3 is the analytics DC. > DC1's keyspaces are replicated to DC1 and DC3 only. > DC2's keyspaces are replicated to DC2 and DC3 only. > > Then you have DC3 with all data from both DC1 and DC2 to run analytics on, > and no cross-talk between DC1 and DC2. > > If you cannot rebuild your existing clusters, you may want to consider > using something like Spark to ETL your data out of DC1 and DC2 into a new > cluster at DC3. At that point you're running a data warehouse and lose some > of the advantages of seemless cluster membership. > > On Wed, Mar 30, 2016 at 5:43 AM, Anishek Agarwal <anis...@gmail.com> > wrote: > >> Hey Guys, >> >> We did the necessary changes and were trying to get this back on track, >> but hit another wall, >> >> we have two Clusters in Different DC ( DC1 and DC2) with cluster names ( >> CLUSTER_1, CLUSTER_2) >> >> we want to have a common analytics cluster in DC3 with cluster name >> (CLUSTER_3). -- looks like this can't be done, so we have to setup two >> different analytics cluster ? can't we just get data from CLUSTER_1/2 to >> same cluster CLUSTER_3 ? >> >> thanks >> anishek >> >> On Mon, Mar 21, 2016 at 3:31 PM, Anishek Agarwal <anis...@gmail.com> >> wrote: >> >>> Hey Clint, >>> >>> we have two separate rings which don't talk to each other but both >>> having the same DC name "DCX". >>> >>> @Raja, >>> >>> We had already gone towards the path you suggested. >>> >>> thanks all >>> anishek >>> >>> On Fri, Mar 18, 2016 at 8:01 AM, Reddy Raja <areddyr...@gmail.com> >>> wrote: >>> >>>> Yes. Here are the steps. >>>> You will have to change the DC Names first. >>>> DC1 and DC2 would be independent clusters. >>>> >>>> Create a new DC, DC3 and include these two DC's on DC3. >>>> >>>> This should work well. >>>> >>>> >>>> On Thu, Mar 17, 2016 at 11:03 PM, Clint Martin < >>>> clintlmar...@coolfiretechnologies.com> wrote: >>>> >>>>> When you say you have two logical DC both with the same name are you >>>>> saying that you have two clusters of servers both with the same DC name, >>>>> nether of which currently talk to each other? IE they are two separate >>>>> rings? >>>>> >>>>> Or do you mean that you have two keyspaces in one cluster? >>>>> >>>>> Or? >>>>> >>>>> Clint >>>>> On Mar 14, 2016 2:11 AM, "Anishek Agarwal" <anis...@gmail.com> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> We are using cassandra 2.0.17 and have two logical DC having >>>>>> different Keyspaces but both having same logical name DC1. >>>>>> >>>>>> we want to setup another cassandra cluster for analytics which should >>>>>> get data from both the above DC. >>>>>> >>>>>> if we setup the new DC with name DC2 and follow the steps >>>>>> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html >>>>>> will it work ? >>>>>> >>>>>> I would think we would have to first change the names of existing >>>>>> clusters to have to different names and then go with adding another dc >>>>>> getting data from these? >>>>>> >>>>>> Also as soon as we add the node the data starts moving... this will >>>>>> all be only real time changes done to the cluster right ? we still have >>>>>> to >>>>>> do the rebuild to get the data for tokens for node in new cluster ? >>>>>> >>>>>> Thanks >>>>>> Anishek >>>>>> >>>>> >>>> >>>> >>>> -- >>>> "In this world, you either have an excuse or a story. I preferred to >>>> have a story" >>>> >>> >>> >> >
Re: Traffic inconsistent across nodes
here is the output: every node in a single DC is in the same rack. Datacenter: WDC5 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.125.138.33 299.22 GB 256 64.2% 8aaa6015-d444-4551-a3c5-3257536df476 RAC1 UN 10.125.138.125 329.38 GB 256 70.3% 70be44a2-de17-41f1-9d3a-6a0be600eedf RAC1 UN 10.125.138.129 305.11 GB 256 65.5% 0fbc7f44-7062-4996-9eba-2a05ae1a7032 RAC1 Datacenter: WDC === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.124.114.105 151.09 GB 256 38.0% c432357d-bf81-4eef-98e1-664c178a3c23 RAC1 UN 10.124.114.110 150.15 GB 256 36.9% 6f92d32e-1c64-4145-83d7-265c331ea408 RAC1 UN 10.124.114.108 170.1 GB 256 41.3% 040ae7e5-3f1e-4874-8738-45edbf576e12 RAC1 UN 10.124.114.98 165.34 GB 256 37.6% cdc69c7d-b9d6-4abd-9388-1cdcd35d946c RAC1 UN 10.124.114.113 145.22 GB 256 35.7% 1557af04-e658-4751-b984-8e0cdc41376e RAC1 UN 10.125.138.59 162.65 GB 256 38.6% 9ba1b7b6-5655-456e-b1a1-6f429750fc96 RAC1 UN 10.124.114.97 164.03 GB 256 36.9% c918e497-498e-44c3-ab01-ab5cb4d48b09 RAC1 UN 10.124.114.118 139.62 GB 256 35.1% 2bb0c265-a5d4-4cd4-8f50-13b5a9a891c9 RAC1 On Thu, Apr 14, 2016 at 4:48 AM, Eric Stevens <migh...@gmail.com> wrote: > The output of nodetool status would really help answer some questions. I > take it the 8 hosts in your graph are in the same DC. Are the four serving > writes in the same logical or physical rack (as Cassandra sees it), while > the others are not? > > On Tue, Apr 12, 2016 at 10:48 PM Anishek Agarwal <anis...@gmail.com> > wrote: > >> We have two DC one with the above 8 nodes and other with 3 nodes. >> >> >> >> On Tue, Apr 12, 2016 at 8:06 PM, Eric Stevens <migh...@gmail.com> wrote: >> >>> Maybe include nodetool status here? Are the four nodes serving reads in >>> one DC (local to your driver's config) while the others are in another? >>> >>> On Tue, Apr 12, 2016, 1:01 AM Anishek Agarwal <anis...@gmail.com> wrote: >>> >>>> hello, >>>> >>>> we have 8 nodes in one cluster and attached is the traffic patterns >>>> across the nodes. >>>> >>>> its very surprising that only 4 nodes show transmitting (purple) >>>> packets. >>>> >>>> our driver configuration on clients has the following load balancing >>>> configuration : >>>> >>>> new TokenAwarePolicy( >>>> new >>>> DCAwareRoundRobinPolicy(configuration.get(Constants.LOCAL_DATA_CENTRE_NAME, >>>> "WDC")), >>>> true) >>>> >>>> >>>> any idea what is that we are missing which is leading to this skewed >>>> data read patterns >>>> >>>> cassandra drivers as below: >>>> >>>> >>>> com.datastax.cassandra >>>> cassandra-driver-core >>>> 2.1.6 >>>> >>>> >>>> com.datastax.cassandra >>>> cassandra-driver-mapping >>>> 2.1.6 >>>> >>>> >>>> cassandra version is 2.0.17 >>>> >>>> Thanks in advance for the help. >>>> >>>> Anishek >>>> >>>> >>
Re: Traffic inconsistent across nodes
Looks like some problem with our monitoring framework. Thanks for you help ! On Mon, Apr 18, 2016 at 2:46 PM, Anishek Agarwal <anis...@gmail.com> wrote: > OS used : Cent OS 6 on all nodes except *10*.125.138.59 ( which runs Cent > OS 7) > All of them are running Cassandra 2.0.17 > > output of the test : > > host ip: 10.124.114.113 > > host DC : WDC > > distance of host: LOCAL > > host is up: true > > cassandra version : 2.0.17 > > host ip: 10.124.114.108 > > host DC : WDC > > distance of host: LOCAL > > host is up: true > > cassandra version : 2.0.17 > > host ip: 10.124.114.110 > > host DC : WDC > > distance of host: LOCAL > > host is up: true > > cassandra version : 2.0.17 > > host ip: 10.124.114.118 > > host DC : WDC > > distance of host: LOCAL > > host is up: true > > cassandra version : 2.0.17 > > host ip: 10.125.138.59 > > host DC : WDC > > distance of host: LOCAL > > host is up: true > > cassandra version : 2.0.17 > > host ip: 10.124.114.97 > > host DC : WDC > > distance of host: LOCAL > > host is up: true > > cassandra version : 2.0.17 > > host ip: 10.124.114.105 > > host DC : WDC > > distance of host: LOCAL > > host is up: true > > cassandra version : 2.0.17 > > host ip: 10.124.114.98 > > host DC : WDC > > distance of host: LOCAL > > host is up: true > > cassandra version : 2.0.17 > > > On Fri, Apr 15, 2016 at 6:47 PM, Eric Stevens <migh...@gmail.com> wrote: > >> Thanks for that, that helps a lot. The next thing to check might be >> whether or not your application actually has access to the other nodes. >> With that topology, and assuming all the nodes you included in your >> original graph are in the 'WDC' data center, I'd be inclined to look for a >> network issue of some kind. >> >> Also, it probably doesn't matter, but what OS / Distribution are you >> running the servers and clients on? >> >> Check with netcat or something that you can reach all the configured >> ports from your application server, but also the driver itself offers some >> introspection into its view of individual connection health. This is a >> little bit ugly, but this is how we include information about connection >> status in an API for health monitoring from a Scala application using the >> Java driver; hopefully you can use it to see how to access information >> about the driver's view of host health from the application's perspective. >> Most importantly I'd suggest looking for host.isUp status and >> LoadBalancingPolicy.distance(host) to see that it considers all the hosts >> in your target datacenter to be LOCAL. >> >> "hosts" -> { >> val hosts: Map[String, Map[String, mutable.Set[Host]]] = >> connection.getMetadata >> .getAllHosts.asScala >> .groupBy(_.getDatacenter) >> .mapValues(_.groupBy(_.getRack)) >> val lbp: LoadBalancingPolicy = >> connection.getConfiguration.getPolicies.getLoadBalancingPolicy >> JsObject(hosts.map { case (dc: String, rackAndHosts) => >> dc -> JsObject(rackAndHosts.map { case (rack: String, hosts: >> mutable.Set[Host]) => >> rack -> JsArray(hosts.map { host => >> Json.obj( >> "address" -> host.getAddress.toString, >> "socketAddress"-> host.getSocketAddress.toString, >> "cassandraVersion" -> host.getCassandraVersion.toString, >> "isUp" -> host.isUp, >> "hostDistance" -> lbp.distance(host).toString >> ) >> }.toSeq) >> }.toSeq) >> }.toSeq) >> }, >> >> >> On Wed, Apr 13, 2016 at 10:50 PM Anishek Agarwal <anis...@gmail.com> >> wrote: >> >>> here is the output: every node in a single DC is in the same rack. >>> >>> Datacenter: WDC5 >>> >>> >>> >>> Status=Up/Down >>> >>> |/ State=Normal/Leaving/Joining/Moving >>> >>> -- Address Load Tokens Owns (effective) Host ID >>> Rack >>> >>> UN 10.125.138.33 299.22 GB 256 64.2% >>> 8aaa6015-d444-4551-a3c5-3257536df476 RAC1 >>> >>> UN 10.125.138.125 329.38 GB 256 70.3% >>> 70be44a2-de17-41f1-9d3a-6a0be600eedf RAC1 >>> >>> UN 10.125.138.129 305.11 GB 256 65.5% >>> 0fbc7f44-7062-4996-9
Re: Traffic inconsistent across nodes
OS used : Cent OS 6 on all nodes except *10*.125.138.59 ( which runs Cent OS 7) All of them are running Cassandra 2.0.17 output of the test : host ip: 10.124.114.113 host DC : WDC distance of host: LOCAL host is up: true cassandra version : 2.0.17 host ip: 10.124.114.108 host DC : WDC distance of host: LOCAL host is up: true cassandra version : 2.0.17 host ip: 10.124.114.110 host DC : WDC distance of host: LOCAL host is up: true cassandra version : 2.0.17 host ip: 10.124.114.118 host DC : WDC distance of host: LOCAL host is up: true cassandra version : 2.0.17 host ip: 10.125.138.59 host DC : WDC distance of host: LOCAL host is up: true cassandra version : 2.0.17 host ip: 10.124.114.97 host DC : WDC distance of host: LOCAL host is up: true cassandra version : 2.0.17 host ip: 10.124.114.105 host DC : WDC distance of host: LOCAL host is up: true cassandra version : 2.0.17 host ip: 10.124.114.98 host DC : WDC distance of host: LOCAL host is up: true cassandra version : 2.0.17 On Fri, Apr 15, 2016 at 6:47 PM, Eric Stevens <migh...@gmail.com> wrote: > Thanks for that, that helps a lot. The next thing to check might be > whether or not your application actually has access to the other nodes. > With that topology, and assuming all the nodes you included in your > original graph are in the 'WDC' data center, I'd be inclined to look for a > network issue of some kind. > > Also, it probably doesn't matter, but what OS / Distribution are you > running the servers and clients on? > > Check with netcat or something that you can reach all the configured ports > from your application server, but also the driver itself offers some > introspection into its view of individual connection health. This is a > little bit ugly, but this is how we include information about connection > status in an API for health monitoring from a Scala application using the > Java driver; hopefully you can use it to see how to access information > about the driver's view of host health from the application's perspective. > Most importantly I'd suggest looking for host.isUp status and > LoadBalancingPolicy.distance(host) to see that it considers all the hosts > in your target datacenter to be LOCAL. > > "hosts" -> { > val hosts: Map[String, Map[String, mutable.Set[Host]]] = > connection.getMetadata > .getAllHosts.asScala > .groupBy(_.getDatacenter) > .mapValues(_.groupBy(_.getRack)) > val lbp: LoadBalancingPolicy = > connection.getConfiguration.getPolicies.getLoadBalancingPolicy > JsObject(hosts.map { case (dc: String, rackAndHosts) => > dc -> JsObject(rackAndHosts.map { case (rack: String, hosts: > mutable.Set[Host]) => > rack -> JsArray(hosts.map { host => > Json.obj( > "address" -> host.getAddress.toString, > "socketAddress"-> host.getSocketAddress.toString, > "cassandraVersion" -> host.getCassandraVersion.toString, > "isUp" -> host.isUp, > "hostDistance" -> lbp.distance(host).toString > ) > }.toSeq) > }.toSeq) > }.toSeq) > }, > > > On Wed, Apr 13, 2016 at 10:50 PM Anishek Agarwal <anis...@gmail.com> > wrote: > >> here is the output: every node in a single DC is in the same rack. >> >> Datacenter: WDC5 >> >> >> >> Status=Up/Down >> >> |/ State=Normal/Leaving/Joining/Moving >> >> -- Address Load Tokens Owns (effective) Host ID >> Rack >> >> UN 10.125.138.33 299.22 GB 256 64.2% >> 8aaa6015-d444-4551-a3c5-3257536df476 RAC1 >> >> UN 10.125.138.125 329.38 GB 256 70.3% >> 70be44a2-de17-41f1-9d3a-6a0be600eedf RAC1 >> >> UN 10.125.138.129 305.11 GB 256 65.5% >> 0fbc7f44-7062-4996-9eba-2a05ae1a7032 RAC1 >> >> Datacenter: WDC >> >> === >> >> Status=Up/Down >> >> |/ State=Normal/Leaving/Joining/Moving >> >> -- Address Load Tokens Owns (effective) Host ID >> Rack >> >> UN 10.124.114.105 151.09 GB 256 38.0% >> c432357d-bf81-4eef-98e1-664c178a3c23 RAC1 >> >> UN 10.124.114.110 150.15 GB 256 36.9% >> 6f92d32e-1c64-4145-83d7-265c331ea408 RAC1 >> >> UN 10.124.114.108 170.1 GB 256 41.3% >> 040ae7e5-3f1e-4874-8738-45edbf576e12 RAC1 >> >> UN 10.124.114.98 165.34 GB 256 37.6% >> cdc69c7d-b9d6-4abd-9388-1cdcd35d946c RAC1 >> >> UN 10.124.114.113 145.22 GB 256 35.7% >> 1557af04-e658-4751
Re: nodetool repair with -pr and -dc
ok thanks, so if we want to use -pr option ( which i suppose we should to prevent duplicate checks) in 2.0 then if we run the repair on all nodes in a single DC then it should be sufficient and we should not need to run it on all nodes across DC's ? On Wed, Aug 10, 2016 at 5:01 PM, Paulo Motta <pauloricard...@gmail.com> wrote: > On 2.0 repair -pr option is not supported together with -local, -hosts or > -dc, since it assumes you need to repair all nodes in all DCs and it will > throw and error if you try to run with nodetool, so perhaps there's > something wrong with range_repair options parsing. > > On 2.1 it was added support to simultaneous -pr and -local options on > CASSANDRA-7450, so if you need that you can either upgade to 2.1 or > backport that to 2.0. > > > 2016-08-10 5:20 GMT-03:00 Anishek Agarwal <anis...@gmail.com>: > >> Hello, >> >> We have 2.0.17 cassandra cluster(*DC1*) with a cross dc setup with a >> smaller cluster(*DC2*). After reading various blogs about >> scheduling/running repairs looks like its good to run it with the following >> >> >> -pr for primary range only >> -st -et for sub ranges >> -par for parallel >> -dc to make sure we can schedule repairs independently on each Data >> centre we have. >> >> i have configured the above using the repair utility @ >> https://github.com/BrianGallew/cassandra_range_repair.git >> >> which leads to the following command : >> >> ./src/range_repair.py -k [keyspace] -c [columnfamily name] -v -H >> localhost -p -D* DC1* >> >> but looks like the merkle tree is being calculated on nodes which are >> part of other *DC2.* >> >> why does this happen? i thought it should only look at the nodes in local >> cluster. however on nodetool the* -pr* option cannot be used with >> *-local* according to docs @https://docs.datastax.com/en/ >> cassandra/2.0/cassandra/tools/toolsRepair.html >> >> so i am may be missing something, can someone help explain this please. >> >> thanks >> anishek >> > >
nodetool repair with -pr and -dc
Hello, We have 2.0.17 cassandra cluster(*DC1*) with a cross dc setup with a smaller cluster(*DC2*). After reading various blogs about scheduling/running repairs looks like its good to run it with the following -pr for primary range only -st -et for sub ranges -par for parallel -dc to make sure we can schedule repairs independently on each Data centre we have. i have configured the above using the repair utility @ https://github.com/BrianGallew/cassandra_range_repair.git which leads to the following command : ./src/range_repair.py -k [keyspace] -c [columnfamily name] -v -H localhost -p -D* DC1* but looks like the merkle tree is being calculated on nodes which are part of other *DC2.* why does this happen? i thought it should only look at the nodes in local cluster. however on nodetool the* -pr* option cannot be used with *-local* according to docs @ https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsRepair.html so i am may be missing something, can someone help explain this please. thanks anishek