Re: Multi-DC Environment Question
Hello again, Back to this after a while... As far as I can tell whenever DC2 is unavailable, there is one node from DC1 that acts as a coordinator. When DC2 is available again, this one node sends the hints to only one node at DC2, which then sends any replicas to the other nodes in the local DC (DC2). This ensures efficient cross-DC bandwidth usage. I was watching system.hints on all nodes during this test and this is the conclusion I came to. Two things: 1. If the above is correct, does the same apply when performing anti-entropy repair (without specifying a particular DC)? I'm just hoping the answer to this is going to be YES, otherwise the VPN is not going to be very happy in our case and we would prefer to not saturate it whenever running nodetool repair. I suppose we could have a traffic limiter on the firewalls worst case scenario but I would appreciate your input if you know more on this. 2. As I described earlier, in order to test this I was watching the system.hints CF in order to monitor any hints. I was looking to add a Nagios check for this purpose. For that reason I was looking into JMX Concole. I noticed that when a node stores hints, MBean org.apache.cassandra.db:type=ColumnFamilies,keyspace=system,columnfamily=hints, attribute MemtableColumnsCount goes up (although I would expect it to be MemtableRowCount or something?). This attribute will retain its value, until the other node becomes available and ready to receive the hints. I was looking for another attribute somewhere to monitor the active hints. I checked: MBean org.apache.cassandra.metrics:type=ColumnFamily,keyspace=system,scope=hints,name=PendingTasks, MBean org.apache.cassandra.metrics:type=Storage,name=TotalHints, MBean org.apache.cassandra.metrics:type=Storage,name=TotalHintsInProgress, MBean org.apache.cassandra.metrics:type=ThreadPools,path=internal,scope=HintedHandoff,name=ActiveTasks and even MBean org.apache.cassandra.metrics:type=HintedHandOffManager,name=Hints_not_stored-/ 10.2.1.100 (this one will never go back to zero). All of them would not increase whenever any hints are being sent (or at least I didn't catch it because it was too fast or whatever?). Does anyone know what all these attributes represent? It looks like there are more specific hint attributes on a per CF basis, but I was looking for a more generic one to begin with. Any help would be much appreciated. Thanks in advance, Vasilis On Wed, Jun 4, 2014 at 1:42 PM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: Hello Matt, nodetool status: Datacenter: MAN === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns (effective) Host ID Token Rack UN 10.2.1.103 89.34 KB 99.2% b7f8bc93-bf39-475c-a251-8fbe2c7f7239 -9211685935328163899 RAC1 UN 10.2.1.102 86.32 KB 0.7% 1f8937e1-9ecb-4e59-896e-6d6ac42dc16d -3511707179720619260 RAC1 Datacenter: DER === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns (effective) Host ID Token Rack UN 10.2.1.101 75.43 KB 0.2% e71c7ee7-d852-4819-81c0-e993ca87dd5c -1277931707251349874 RAC1 UN 10.2.1.100 104.53 KB 99.8% 7333b664-ce2d-40cf-986f-d4b4d4023726 -9204412570946850701 RAC1 I do not know why the cluster is not balanced at the moment, but it holds almost no data. I will populate it soon and see how that goes. The output of 'nodetool ring' just lists all the tokens assigned to each individual node, and as you can imagine it would be pointless to paste it here. I just did 'nodetool ring | awk ... | unique | wc -l' and it works out to be 1024 as expected (4 nodes x 256 tokens each). Still have not got the answers to the other questions though... Thanks, Vasilis On Wed, Jun 4, 2014 at 12:28 AM, Matthew Allen matthew.j.al...@gmail.com wrote: Thanks Vasileios. I think I need to make a call as to whether to switch to vnodes or stick with tokens for my Multi-DC cluster. Would you be able to show a nodetool ring/status from your cluster to see what the token assignment looks like ? Thanks Matt On Wed, Jun 4, 2014 at 8:31 AM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: I should have said that earlier really... I am using 1.2.16 and Vnodes are enabled. Thanks, Vasilis -- Kind Regards, Vasileios Vlachos
Re: Multi-DC Environment Question
Hello Matt, nodetool status: Datacenter: MAN === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns (effective) Host ID Token Rack UN 10.2.1.103 89.34 KB 99.2% b7f8bc93-bf39-475c-a251-8fbe2c7f7239 -9211685935328163899 RAC1 UN 10.2.1.102 86.32 KB 0.7% 1f8937e1-9ecb-4e59-896e-6d6ac42dc16d -3511707179720619260 RAC1 Datacenter: DER === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Owns (effective) Host ID Token Rack UN 10.2.1.101 75.43 KB 0.2% e71c7ee7-d852-4819-81c0-e993ca87dd5c -1277931707251349874 RAC1 UN 10.2.1.100 104.53 KB 99.8% 7333b664-ce2d-40cf-986f-d4b4d4023726 -9204412570946850701 RAC1 I do not know why the cluster is not balanced at the moment, but it holds almost no data. I will populate it soon and see how that goes. The output of 'nodetool ring' just lists all the tokens assigned to each individual node, and as you can imagine it would be pointless to paste it here. I just did 'nodetool ring | awk ... | unique | wc -l' and it works out to be 1024 as expected (4 nodes x 256 tokens each). Still have not got the answers to the other questions though... Thanks, Vasilis On Wed, Jun 4, 2014 at 12:28 AM, Matthew Allen matthew.j.al...@gmail.com wrote: Thanks Vasileios. I think I need to make a call as to whether to switch to vnodes or stick with tokens for my Multi-DC cluster. Would you be able to show a nodetool ring/status from your cluster to see what the token assignment looks like ? Thanks Matt On Wed, Jun 4, 2014 at 8:31 AM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: I should have said that earlier really... I am using 1.2.16 and Vnodes are enabled. Thanks, Vasilis -- Kind Regards, Vasileios Vlachos
Re: Multi-DC Environment Question
On Fri, May 30, 2014 at 4:08 AM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: Basically you sort of confirmed that if down_time max_hint_window_in_ms the only way to bring DC1 up-to-date is anti-entropy repair. Also, read repair does not help either as we assumed that down_time max_hint_window_in_ms. Please correct me if I am wrong. My understanding is that if you : 1) set read repair chance to 100% 2) read all keys in the keyspace with a client You would accomplish the same increase in consistency as you would by running repair. In cases where this may matter, and your system can handle delivering the hints, increasing the already-increased-from-old-default-of-1-hour current default of 3 hours to 6 or more hours gives operators more time to work in the case of partition or failure. Note that hints are only an optimization, only repair (and read repair at 100%, I think..) assert any guarantee of consistency. =Rob
Re: Multi-DC Environment Question
Thanks for your responses! Matt, I did a test with 4 nodes, 2 in each DC and the answer appears to be yes. The tokens seem to be unique across the entire cluster, not just on a per DC basis. I don't know if the number of nodes deployed is enough to reassure me, but this is my conclusion for now. Please, correct me if you know I'm wrong. Rob, this is the plan of attack I have in mind now. Although, in case of a catastrophic failure of a DC, the downtime is usually longer than that. So it's either less than the default value (when testing that the DR works for example) or more (actually using the DR as primary DC). Based on that, the default seems reasonable to me. I also found that nodetool repair can be performed on one DC only by specifying the --in-local-dc option. So, presumably the classic nodetool repair applies to the entire cluster (sounds obvious, but is that actually correct?). Question 3 in my previous email still remains unanswered to me... I cannot find out if there is only one hint stored in the coordinator irrespective of number of replicas being down, and also if the hint is 100% of the size of the original write request. Thanks, Vasilis On 03/06/14 18:52, Robert Coli wrote: On Fri, May 30, 2014 at 4:08 AM, Vasileios Vlachos vasileiosvlac...@gmail.com mailto:vasileiosvlac...@gmail.com wrote: Basically you sort of confirmed that if down_time max_hint_window_in_ms the only way to bring DC1 up-to-date is anti-entropy repair. Also, read repair does not help either as we assumed that down_time max_hint_window_in_ms. Please correct me if I am wrong. My understanding is that if you : 1) set read repair chance to 100% 2) read all keys in the keyspace with a client You would accomplish the same increase in consistency as you would by running repair. In cases where this may matter, and your system can handle delivering the hints, increasing the already-increased-from-old-default-of-1-hour current default of 3 hours to 6 or more hours gives operators more time to work in the case of partition or failure. Note that hints are only an optimization, only repair (and read repair at 100%, I think..) assert any guarantee of consistency. =Rob -- Kind Regards, Vasileios Vlachos
Re: Multi-DC Environment Question
I should have said that earlier really... I am using 1.2.16 and Vnodes are enabled. Thanks, Vasilis -- Kind Regards, Vasileios Vlachos
Re: Multi-DC Environment Question
Hi Vasilis, With regards to Question 2. * | How tokens are being assigned when adding a 2nd DC? Is the range -2^64 to 2^63 for each DC, or it is -2^64 to 2^63 for the entire cluster? (I think the latter is correct), * Have you been able to deduce an answer to this (assuming Murmur3 Partitioner) ? I'm facing the same problem and looking to do the latter (-2^64 to 2^63 for the entire cluster) given how nodetool repair isn't multi-dc aware in ( https://issues.apache.org/jira/browse/CASSANDRA-2609) Unfortunately the documentation ( http://www.datastax.com/documentation/cassandra/1.2/cassandra/configuration/configGenTokens_c.html) isn't quite clear (Multiple data center deployments: calculate the tokens for each data center so that the hash range is evenly divided for the nodes in each data center.) Thanks Matt On Fri, May 30, 2014 at 9:08 PM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: Thanks for your responses, Ben thanks for the link. Basically you sort of confirmed that if down_time max_hint_window_in_ms the only way to bring DC1 up-to-date is anti-entropy repair. Read consistency level is irrelevant to the problem I described as I am reading LOCAL_QUORUM. In this situation I lost whatever data -if any- had not been transfered across to DC2 before DC1 went down, that is understandable. Also, read repair does not help either as we assumed that down_time max_hint_window_in_ms. Please correct me if I am wrong. I think I could better understand how that works if I knew the answers to the following questions: 1. What is the output of nodetool status when a cluster spans across 2 DCs? Will I be able to see ALL nodes irrespective of the DC they belong to? 2. How tokens are being assigned when adding a 2nd DC? Is the range -2^64 to 2^63 for each DC, or it is -2^64 to 2^63 for the entire cluster? (I think the latter is correct) 3. Does the coordinator store 1 hint irrespective of how many replicas happen to be down at the time and also irrespective of DC2 being down in the scenario I described above? (I think the answer is according to the presentation you sent me, but I would like someone to confirm that) Thank you in advance, Vasilis On Fri, May 30, 2014 at 3:13 AM, Ben Bromhead b...@instaclustr.com wrote: Short answer: If time elapsed max_hint_window_in_ms then hints will stop being created. You will need to rely on your read consistency level, read repair and anti-entropy repair operations to restore consistency. Long answer: http://www.slideshare.net/jasedbrown/understanding-antientropy-in-cassandra Ben Bromhead Instaclustr | www.instaclustr.com | @instaclustr http://twitter.com/instaclustr | +61 415 936 359 On 30 May 2014, at 8:40 am, Tupshin Harper tups...@tupshin.com wrote: When one node or DC is down, coordinator nodes being written through will notice this fact and store hints (hinted handoff is the mechanism), and those hints are used to send the data that was not able to be replicated initially. http://www.datastax.com/dev/blog/modern-hinted-handoff -Tupshin On May 29, 2014 6:22 PM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: Hello All, We have plans to add a second DC to our live Cassandra environment. Currently RF=3 and we read and write at QUORUM. After adding DC2 we are going to be reading and writing at LOCAL_QUORUM. If my understanding is correct, when a client sends a write request, if the consistency level is satisfied on DC1 (that is RF/2+1), success is returned to the client and DC2 will eventually get the data as well. The assumption behind this is that the the client always connects to DC1 for reads and writes and given that there is a site-to-site VPN between DC1 and DC2. Therefore, DC1 will almost always return success before DC2 (actually I don't know if it is possible for DC2 to be more up-to-date than DC1 with this setup...). Now imagine DC1 looses connectivity and the client fails over to DC2. Everything should work fine after that, with the only difference that DC2 will be now handling the requests directly from the client. After some time, say after max_hint_window_in_ms, DC1 comes back up. My question is how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will that require a nodetool repair on DC1 nodes? Also, what is the answer when the outage is max_hint_window_in_ms instead? Thanks in advance! Vasilis -- Kind Regards, Vasileios Vlachos
Re: Multi-DC Environment Question
Thanks for your responses, Ben thanks for the link. Basically you sort of confirmed that if down_time max_hint_window_in_ms the only way to bring DC1 up-to-date is anti-entropy repair. Read consistency level is irrelevant to the problem I described as I am reading LOCAL_QUORUM. In this situation I lost whatever data -if any- had not been transfered across to DC2 before DC1 went down, that is understandable. Also, read repair does not help either as we assumed that down_time max_hint_window_in_ms. Please correct me if I am wrong. I think I could better understand how that works if I knew the answers to the following questions: 1. What is the output of nodetool status when a cluster spans across 2 DCs? Will I be able to see ALL nodes irrespective of the DC they belong to? 2. How tokens are being assigned when adding a 2nd DC? Is the range -2^64 to 2^63 for each DC, or it is -2^64 to 2^63 for the entire cluster? (I think the latter is correct) 3. Does the coordinator store 1 hint irrespective of how many replicas happen to be down at the time and also irrespective of DC2 being down in the scenario I described above? (I think the answer is according to the presentation you sent me, but I would like someone to confirm that) Thank you in advance, Vasilis On Fri, May 30, 2014 at 3:13 AM, Ben Bromhead b...@instaclustr.com wrote: Short answer: If time elapsed max_hint_window_in_ms then hints will stop being created. You will need to rely on your read consistency level, read repair and anti-entropy repair operations to restore consistency. Long answer: http://www.slideshare.net/jasedbrown/understanding-antientropy-in-cassandra Ben Bromhead Instaclustr | www.instaclustr.com | @instaclustr http://twitter.com/instaclustr | +61 415 936 359 On 30 May 2014, at 8:40 am, Tupshin Harper tups...@tupshin.com wrote: When one node or DC is down, coordinator nodes being written through will notice this fact and store hints (hinted handoff is the mechanism), and those hints are used to send the data that was not able to be replicated initially. http://www.datastax.com/dev/blog/modern-hinted-handoff -Tupshin On May 29, 2014 6:22 PM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: Hello All, We have plans to add a second DC to our live Cassandra environment. Currently RF=3 and we read and write at QUORUM. After adding DC2 we are going to be reading and writing at LOCAL_QUORUM. If my understanding is correct, when a client sends a write request, if the consistency level is satisfied on DC1 (that is RF/2+1), success is returned to the client and DC2 will eventually get the data as well. The assumption behind this is that the the client always connects to DC1 for reads and writes and given that there is a site-to-site VPN between DC1 and DC2. Therefore, DC1 will almost always return success before DC2 (actually I don't know if it is possible for DC2 to be more up-to-date than DC1 with this setup...). Now imagine DC1 looses connectivity and the client fails over to DC2. Everything should work fine after that, with the only difference that DC2 will be now handling the requests directly from the client. After some time, say after max_hint_window_in_ms, DC1 comes back up. My question is how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will that require a nodetool repair on DC1 nodes? Also, what is the answer when the outage is max_hint_window_in_ms instead? Thanks in advance! Vasilis -- Kind Regards, Vasileios Vlachos
Multi-DC Environment Question
Hello All, We have plans to add a second DC to our live Cassandra environment. Currently RF=3 and we read and write at QUORUM. After adding DC2 we are going to be reading and writing at LOCAL_QUORUM. If my understanding is correct, when a client sends a write request, if the consistency level is satisfied on DC1 (that is RF/2+1), success is returned to the client and DC2 will eventually get the data as well. The assumption behind this is that the the client always connects to DC1 for reads and writes and given that there is a site-to-site VPN between DC1 and DC2. Therefore, DC1 will almost always return success before DC2 (actually I don't know if it is possible for DC2 to be more up-to-date than DC1 with this setup...). Now imagine DC1 looses connectivity and the client fails over to DC2. Everything should work fine after that, with the only difference that DC2 will be now handling the requests directly from the client. After some time, say after max_hint_window_in_ms, DC1 comes back up. My question is how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will that require a nodetool repair on DC1 nodes? Also, what is the answer when the outage is max_hint_window_in_msinstead? Thanks in advance! Vasilis -- Kind Regards, Vasileios Vlachos
Re: Multi-DC Environment Question
When one node or DC is down, coordinator nodes being written through will notice this fact and store hints (hinted handoff is the mechanism), and those hints are used to send the data that was not able to be replicated initially. http://www.datastax.com/dev/blog/modern-hinted-handoff -Tupshin On May 29, 2014 6:22 PM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: Hello All, We have plans to add a second DC to our live Cassandra environment. Currently RF=3 and we read and write at QUORUM. After adding DC2 we are going to be reading and writing at LOCAL_QUORUM. If my understanding is correct, when a client sends a write request, if the consistency level is satisfied on DC1 (that is RF/2+1), success is returned to the client and DC2 will eventually get the data as well. The assumption behind this is that the the client always connects to DC1 for reads and writes and given that there is a site-to-site VPN between DC1 and DC2. Therefore, DC1 will almost always return success before DC2 (actually I don't know if it is possible for DC2 to be more up-to-date than DC1 with this setup...). Now imagine DC1 looses connectivity and the client fails over to DC2. Everything should work fine after that, with the only difference that DC2 will be now handling the requests directly from the client. After some time, say after max_hint_window_in_ms, DC1 comes back up. My question is how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will that require a nodetool repair on DC1 nodes? Also, what is the answer when the outage is max_hint_window_in_ms instead? Thanks in advance! Vasilis -- Kind Regards, Vasileios Vlachos
Re: Multi-DC Environment Question
Short answer: If time elapsed max_hint_window_in_ms then hints will stop being created. You will need to rely on your read consistency level, read repair and anti-entropy repair operations to restore consistency. Long answer: http://www.slideshare.net/jasedbrown/understanding-antientropy-in-cassandra Ben Bromhead Instaclustr | www.instaclustr.com | @instaclustr | +61 415 936 359 On 30 May 2014, at 8:40 am, Tupshin Harper tups...@tupshin.com wrote: When one node or DC is down, coordinator nodes being written through will notice this fact and store hints (hinted handoff is the mechanism), and those hints are used to send the data that was not able to be replicated initially. http://www.datastax.com/dev/blog/modern-hinted-handoff -Tupshin On May 29, 2014 6:22 PM, Vasileios Vlachos vasileiosvlac...@gmail.com wrote: Hello All, We have plans to add a second DC to our live Cassandra environment. Currently RF=3 and we read and write at QUORUM. After adding DC2 we are going to be reading and writing at LOCAL_QUORUM. If my understanding is correct, when a client sends a write request, if the consistency level is satisfied on DC1 (that is RF/2+1), success is returned to the client and DC2 will eventually get the data as well. The assumption behind this is that the the client always connects to DC1 for reads and writes and given that there is a site-to-site VPN between DC1 and DC2. Therefore, DC1 will almost always return success before DC2 (actually I don't know if it is possible for DC2 to be more up-to-date than DC1 with this setup...). Now imagine DC1 looses connectivity and the client fails over to DC2. Everything should work fine after that, with the only difference that DC2 will be now handling the requests directly from the client. After some time, say after max_hint_window_in_ms, DC1 comes back up. My question is how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will that require a nodetool repair on DC1 nodes? Also, what is the answer when the outage is max_hint_window_in_ms instead? Thanks in advance! Vasilis -- Kind Regards, Vasileios Vlachos