Re: Multi-DC Environment Question

2014-06-16 Thread Vasileios Vlachos
Hello again,

Back to this after a while...

As far as I can tell whenever DC2 is unavailable, there is one node from
DC1 that acts as a coordinator. When DC2 is available again, this one node
sends the hints to only one node at DC2, which then sends any replicas to
the other nodes in the local DC (DC2). This ensures efficient cross-DC
bandwidth usage. I was watching system.hints on all nodes during this
test and this is the conclusion I came to.

Two things:
1. If the above is correct, does the same apply when performing
anti-entropy repair (without specifying a particular DC)? I'm just hoping
the answer to this is going to be YES, otherwise the VPN is not going to be
very happy in our case and we would prefer to not saturate it whenever
running nodetool repair. I suppose we could have a traffic limiter on the
firewalls worst case scenario but I would appreciate your input if you know
more on this.

2. As I described earlier, in order to test this I was watching the
system.hints CF in order to monitor any hints. I was looking to add a
Nagios check for this purpose. For that reason I was looking into JMX
Concole. I noticed that when a node stores hints, MBean
org.apache.cassandra.db:type=ColumnFamilies,keyspace=system,columnfamily=hints,
attribute MemtableColumnsCount goes up (although I would expect it to be
MemtableRowCount or something?). This attribute will retain its value,
until the other node becomes available and ready to receive the hints. I
was looking for another attribute somewhere to monitor the active hints. I
checked:

MBean
org.apache.cassandra.metrics:type=ColumnFamily,keyspace=system,scope=hints,name=PendingTasks,

MBean org.apache.cassandra.metrics:type=Storage,name=TotalHints,
MBean
org.apache.cassandra.metrics:type=Storage,name=TotalHintsInProgress,
MBean
org.apache.cassandra.metrics:type=ThreadPools,path=internal,scope=HintedHandoff,name=ActiveTasks
and even
MBean
org.apache.cassandra.metrics:type=HintedHandOffManager,name=Hints_not_stored-/
10.2.1.100 (this one will never go back to zero).

All of them would not increase whenever any hints are being sent (or at
least I didn't catch it because it was too fast or whatever?). Does anyone
know what all these attributes represent? It looks like there are more
specific hint attributes on a per CF basis, but I was looking for a more
generic one to begin with. Any help would be much appreciated.

Thanks in advance,

Vasilis


On Wed, Jun 4, 2014 at 1:42 PM, Vasileios Vlachos 
vasileiosvlac...@gmail.com wrote:

 Hello Matt,

 nodetool status:

 Datacenter: MAN
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 -- Address Load Owns (effective) Host ID Token Rack
 UN 10.2.1.103 89.34 KB 99.2% b7f8bc93-bf39-475c-a251-8fbe2c7f7239
 -9211685935328163899 RAC1
 UN 10.2.1.102 86.32 KB 0.7% 1f8937e1-9ecb-4e59-896e-6d6ac42dc16d
 -3511707179720619260 RAC1
 Datacenter: DER
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 -- Address Load Owns (effective) Host ID Token Rack
 UN 10.2.1.101 75.43 KB 0.2% e71c7ee7-d852-4819-81c0-e993ca87dd5c
 -1277931707251349874 RAC1
 UN 10.2.1.100 104.53 KB 99.8% 7333b664-ce2d-40cf-986f-d4b4d4023726
 -9204412570946850701 RAC1

 I do not know why the cluster is not balanced at the moment, but it holds
 almost no data. I will populate it soon and see how that goes. The output
 of 'nodetool ring' just lists all the tokens assigned to each individual
 node, and as you can imagine it would be pointless to paste it here. I just
 did 'nodetool ring | awk ... | unique | wc -l' and it works out to be 1024
 as expected (4 nodes x 256 tokens each).

 Still have not got the answers to the other questions though...

 Thanks,

 Vasilis


 On Wed, Jun 4, 2014 at 12:28 AM, Matthew Allen matthew.j.al...@gmail.com
 wrote:

 Thanks Vasileios.  I think I need to make a call as to whether to switch
 to vnodes or stick with tokens for my Multi-DC cluster.

 Would you be able to show a nodetool ring/status from your cluster to see
 what the token assignment looks like ?

 Thanks

 Matt


 On Wed, Jun 4, 2014 at 8:31 AM, Vasileios Vlachos 
 vasileiosvlac...@gmail.com wrote:

  I should have said that earlier really... I am using 1.2.16 and Vnodes
 are enabled.

 Thanks,

 Vasilis

 --
 Kind Regards,

 Vasileios Vlachos






Re: Multi-DC Environment Question

2014-06-04 Thread Vasileios Vlachos
Hello Matt,

nodetool status:

Datacenter: MAN
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack
UN 10.2.1.103 89.34 KB 99.2% b7f8bc93-bf39-475c-a251-8fbe2c7f7239
-9211685935328163899 RAC1
UN 10.2.1.102 86.32 KB 0.7% 1f8937e1-9ecb-4e59-896e-6d6ac42dc16d
-3511707179720619260 RAC1
Datacenter: DER
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack
UN 10.2.1.101 75.43 KB 0.2% e71c7ee7-d852-4819-81c0-e993ca87dd5c
-1277931707251349874 RAC1
UN 10.2.1.100 104.53 KB 99.8% 7333b664-ce2d-40cf-986f-d4b4d4023726
-9204412570946850701 RAC1

I do not know why the cluster is not balanced at the moment, but it holds
almost no data. I will populate it soon and see how that goes. The output
of 'nodetool ring' just lists all the tokens assigned to each individual
node, and as you can imagine it would be pointless to paste it here. I just
did 'nodetool ring | awk ... | unique | wc -l' and it works out to be 1024
as expected (4 nodes x 256 tokens each).

Still have not got the answers to the other questions though...

Thanks,

Vasilis


On Wed, Jun 4, 2014 at 12:28 AM, Matthew Allen matthew.j.al...@gmail.com
wrote:

 Thanks Vasileios.  I think I need to make a call as to whether to switch
 to vnodes or stick with tokens for my Multi-DC cluster.

 Would you be able to show a nodetool ring/status from your cluster to see
 what the token assignment looks like ?

 Thanks

 Matt


 On Wed, Jun 4, 2014 at 8:31 AM, Vasileios Vlachos 
 vasileiosvlac...@gmail.com wrote:

  I should have said that earlier really... I am using 1.2.16 and Vnodes
 are enabled.

 Thanks,

 Vasilis

 --
 Kind Regards,

 Vasileios Vlachos





Re: Multi-DC Environment Question

2014-06-03 Thread Robert Coli
On Fri, May 30, 2014 at 4:08 AM, Vasileios Vlachos 
vasileiosvlac...@gmail.com wrote:

 Basically you sort of confirmed that if down_time  max_hint_window_in_ms
 the only way to bring DC1 up-to-date is anti-entropy repair.


Also, read repair does not help either as we assumed that down_time 
 max_hint_window_in_ms. Please correct me if I am wrong.


My understanding is that if you :

1) set read repair chance to 100%
2) read all keys in the keyspace with a client

You would accomplish the same increase in consistency as you would by
running repair.

In cases where this may matter, and your system can handle delivering the
hints, increasing the already-increased-from-old-default-of-1-hour current
default of 3 hours to 6 or more hours gives operators more time to work in
the case of partition or failure. Note that hints are only an optimization,
only repair (and read repair at 100%, I think..) assert any guarantee of
consistency.

=Rob


Re: Multi-DC Environment Question

2014-06-03 Thread Vasileios Vlachos

Thanks for your responses!

Matt, I did a test with 4 nodes, 2 in each DC and the answer appears to 
be yes. The tokens seem to be unique across the entire cluster, not just 
on a per DC basis. I don't know if the number of nodes deployed is 
enough to reassure me, but this is my conclusion for now. Please, 
correct me if you know I'm wrong.


Rob, this is the plan of attack I have in mind now. Although, in case of 
a catastrophic failure of a DC, the downtime is usually longer than 
that. So it's either less than the default value (when testing that the 
DR works for example) or more (actually using the DR as primary DC). 
Based on that, the default seems reasonable to me.


I also found that nodetool repair can be performed on one DC only by 
specifying the --in-local-dc option. So, presumably the classic nodetool 
repair applies to the entire cluster (sounds obvious, but is that 
actually correct?).


Question 3 in my previous email still remains unanswered to me... I 
cannot find out if there is only one hint stored in the coordinator 
irrespective of number of replicas being down, and also if the hint is 
100% of the size of the original write request.


Thanks,

Vasilis

On 03/06/14 18:52, Robert Coli wrote:
On Fri, May 30, 2014 at 4:08 AM, Vasileios Vlachos 
vasileiosvlac...@gmail.com mailto:vasileiosvlac...@gmail.com wrote:


Basically you sort of confirmed that if down_time 
max_hint_window_in_ms the only way to bring DC1 up-to-date is
anti-entropy repair.


Also, read repair does not help either as we assumed that
down_time  max_hint_window_in_ms. Please correct me if I am wrong.


My understanding is that if you :

1) set read repair chance to 100%
2) read all keys in the keyspace with a client

You would accomplish the same increase in consistency as you would by 
running repair.


In cases where this may matter, and your system can handle delivering 
the hints, increasing the already-increased-from-old-default-of-1-hour 
current default of 3 hours to 6 or more hours gives operators more 
time to work in the case of partition or failure. Note that hints are 
only an optimization, only repair (and read repair at 100%, I think..) 
assert any guarantee of consistency.


=Rob



--
Kind Regards,

Vasileios Vlachos



Re: Multi-DC Environment Question

2014-06-03 Thread Vasileios Vlachos
I should have said that earlier really... I am using 1.2.16 and Vnodes 
are enabled.


Thanks,

Vasilis

--
Kind Regards,

Vasileios Vlachos



Re: Multi-DC Environment Question

2014-06-02 Thread Matthew Allen
Hi Vasilis,

With regards to Question 2.

* | How tokens are being assigned when adding a 2nd DC? Is the
range -2^64 to 2^63 for each DC, or it is  -2^64 to 2^63 for the entire
cluster? (I think the latter is correct), *

Have you been able to deduce an answer to this (assuming Murmur3
Partitioner) ?

I'm facing the same problem and looking to do the latter (-2^64 to 2^63 for
the entire cluster) given how nodetool repair isn't multi-dc aware in (
https://issues.apache.org/jira/browse/CASSANDRA-2609)

Unfortunately the documentation (
http://www.datastax.com/documentation/cassandra/1.2/cassandra/configuration/configGenTokens_c.html)
isn't quite clear (Multiple data center deployments: calculate the tokens
for each data center so that the hash range is evenly divided for the nodes
in each data center.)

Thanks

Matt


On Fri, May 30, 2014 at 9:08 PM, Vasileios Vlachos 
vasileiosvlac...@gmail.com wrote:


 Thanks for your responses, Ben thanks for the link.

 Basically you sort of confirmed that if down_time  max_hint_window_in_ms
 the only way to bring DC1 up-to-date is anti-entropy repair. Read
 consistency level is irrelevant to the problem I described as I am reading
 LOCAL_QUORUM. In this situation I lost whatever data -if any- had not been
 transfered across to DC2 before DC1 went down, that is understandable.
 Also, read repair does not help either as we assumed that down_time 
 max_hint_window_in_ms. Please correct me if I am wrong.

 I think I could better understand how that works if I knew the answers to
 the following questions:
 1. What is the output of nodetool status when a cluster spans across 2
 DCs? Will I be able to see ALL nodes irrespective of the DC they belong to?
 2. How tokens are being assigned when adding a 2nd DC? Is the range -2^64
 to 2^63 for each DC, or it is  -2^64 to 2^63 for the entire cluster? (I
 think the latter is correct)
 3. Does the coordinator store 1 hint irrespective of how many replicas
 happen to be down at the time and also irrespective of DC2 being down in
 the scenario I described above? (I think the answer is according to the
 presentation you sent me, but I would like someone to confirm that)

 Thank you in advance,

 Vasilis


 On Fri, May 30, 2014 at 3:13 AM, Ben Bromhead b...@instaclustr.com wrote:

 Short answer:

 If time elapsed  max_hint_window_in_ms then hints will stop being
 created. You will need to rely on your read consistency level, read repair
 and anti-entropy repair operations to restore consistency.

 Long answer:


 http://www.slideshare.net/jasedbrown/understanding-antientropy-in-cassandra

 Ben Bromhead
 Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | +61 415 936 359

 On 30 May 2014, at 8:40 am, Tupshin Harper tups...@tupshin.com wrote:

 When one node or DC is down, coordinator nodes being written through will
 notice this fact and store hints (hinted handoff is the mechanism),  and
 those hints are used to send the data that was not able to be replicated
 initially.

 http://www.datastax.com/dev/blog/modern-hinted-handoff

 -Tupshin
 On May 29, 2014 6:22 PM, Vasileios Vlachos vasileiosvlac...@gmail.com
 wrote:

  Hello All,

 We have plans to add a second DC to our live Cassandra environment.
 Currently RF=3 and we read and write at QUORUM. After adding DC2 we are
 going to be reading and writing at LOCAL_QUORUM.

 If my understanding is correct, when a client sends a write request, if
 the consistency level is satisfied on DC1 (that is RF/2+1), success is
 returned to the client and DC2 will eventually get the data as well. The
 assumption behind this is that the the client always connects to DC1 for
 reads and writes and given that there is a site-to-site VPN between DC1 and
 DC2. Therefore, DC1 will almost always return success before DC2 (actually
 I don't know if it is possible for DC2 to be more up-to-date than DC1 with
 this setup...).

 Now imagine DC1 looses connectivity and the client fails over to DC2.
 Everything should work fine after that, with the only difference that DC2
 will be now handling the requests directly from the client. After some
 time, say after max_hint_window_in_ms, DC1 comes back up. My question is
 how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will
 that require a nodetool repair on DC1 nodes? Also, what is the answer
 when the outage is  max_hint_window_in_ms instead?

 Thanks in advance!

 Vasilis

 --
 Kind Regards,

 Vasileios Vlachos






Re: Multi-DC Environment Question

2014-05-30 Thread Vasileios Vlachos
Thanks for your responses, Ben thanks for the link.

Basically you sort of confirmed that if down_time  max_hint_window_in_ms
the only way to bring DC1 up-to-date is anti-entropy repair. Read
consistency level is irrelevant to the problem I described as I am reading
LOCAL_QUORUM. In this situation I lost whatever data -if any- had not been
transfered across to DC2 before DC1 went down, that is understandable.
Also, read repair does not help either as we assumed that down_time 
max_hint_window_in_ms. Please correct me if I am wrong.

I think I could better understand how that works if I knew the answers to
the following questions:
1. What is the output of nodetool status when a cluster spans across 2 DCs?
Will I be able to see ALL nodes irrespective of the DC they belong to?
2. How tokens are being assigned when adding a 2nd DC? Is the range -2^64
to 2^63 for each DC, or it is  -2^64 to 2^63 for the entire cluster? (I
think the latter is correct)
3. Does the coordinator store 1 hint irrespective of how many replicas
happen to be down at the time and also irrespective of DC2 being down in
the scenario I described above? (I think the answer is according to the
presentation you sent me, but I would like someone to confirm that)

Thank you in advance,

Vasilis


On Fri, May 30, 2014 at 3:13 AM, Ben Bromhead b...@instaclustr.com wrote:

 Short answer:

 If time elapsed  max_hint_window_in_ms then hints will stop being
 created. You will need to rely on your read consistency level, read repair
 and anti-entropy repair operations to restore consistency.

 Long answer:

 http://www.slideshare.net/jasedbrown/understanding-antientropy-in-cassandra

 Ben Bromhead
 Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | +61 415 936 359

 On 30 May 2014, at 8:40 am, Tupshin Harper tups...@tupshin.com wrote:

 When one node or DC is down, coordinator nodes being written through will
 notice this fact and store hints (hinted handoff is the mechanism),  and
 those hints are used to send the data that was not able to be replicated
 initially.

 http://www.datastax.com/dev/blog/modern-hinted-handoff

 -Tupshin
 On May 29, 2014 6:22 PM, Vasileios Vlachos vasileiosvlac...@gmail.com
 wrote:

  Hello All,

 We have plans to add a second DC to our live Cassandra environment.
 Currently RF=3 and we read and write at QUORUM. After adding DC2 we are
 going to be reading and writing at LOCAL_QUORUM.

 If my understanding is correct, when a client sends a write request, if
 the consistency level is satisfied on DC1 (that is RF/2+1), success is
 returned to the client and DC2 will eventually get the data as well. The
 assumption behind this is that the the client always connects to DC1 for
 reads and writes and given that there is a site-to-site VPN between DC1 and
 DC2. Therefore, DC1 will almost always return success before DC2 (actually
 I don't know if it is possible for DC2 to be more up-to-date than DC1 with
 this setup...).

 Now imagine DC1 looses connectivity and the client fails over to DC2.
 Everything should work fine after that, with the only difference that DC2
 will be now handling the requests directly from the client. After some
 time, say after max_hint_window_in_ms, DC1 comes back up. My question is
 how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will
 that require a nodetool repair on DC1 nodes? Also, what is the answer
 when the outage is  max_hint_window_in_ms instead?

 Thanks in advance!

 Vasilis

 --
 Kind Regards,

 Vasileios Vlachos





Multi-DC Environment Question

2014-05-29 Thread Vasileios Vlachos

Hello All,

We have plans to add a second DC to our live Cassandra environment. 
Currently RF=3 and we read and write at QUORUM. After adding DC2 we are 
going to be reading and writing at LOCAL_QUORUM.


If my understanding is correct, when a client sends a write request, if 
the consistency level is satisfied on DC1 (that is RF/2+1), success is 
returned to the client and DC2 will eventually get the data as well. The 
assumption behind this is that the the client always connects to DC1 for 
reads and writes and given that there is a site-to-site VPN between DC1 
and DC2. Therefore, DC1 will almost always return success before DC2 
(actually I don't know if it is possible for DC2 to be more up-to-date 
than DC1 with this setup...).


Now imagine DC1 looses connectivity and the client fails over to DC2. 
Everything should work fine after that, with the only difference that 
DC2 will be now handling the requests directly from the client. After 
some time, say after max_hint_window_in_ms, DC1 comes back up. My 
question is how do I bring DC1 up to speed with DC2 which is now more 
up-to-date? Will that require a nodetool repair on DC1 nodes? Also, what 
is the answer when the outage is  max_hint_window_in_msinstead?


Thanks in advance!

Vasilis

--
Kind Regards,

Vasileios Vlachos



Re: Multi-DC Environment Question

2014-05-29 Thread Tupshin Harper
When one node or DC is down, coordinator nodes being written through will
notice this fact and store hints (hinted handoff is the mechanism),  and
those hints are used to send the data that was not able to be replicated
initially.

http://www.datastax.com/dev/blog/modern-hinted-handoff

-Tupshin
On May 29, 2014 6:22 PM, Vasileios Vlachos vasileiosvlac...@gmail.com
wrote:

 Hello All,

We have plans to add a second DC to our live Cassandra environment.
Currently RF=3 and we read and write at QUORUM. After adding DC2 we are
going to be reading and writing at LOCAL_QUORUM.

If my understanding is correct, when a client sends a write request, if the
consistency level is satisfied on DC1 (that is RF/2+1), success is returned
to the client and DC2 will eventually get the data as well. The assumption
behind this is that the the client always connects to DC1 for reads and
writes and given that there is a site-to-site VPN between DC1 and DC2.
Therefore, DC1 will almost always return success before DC2 (actually I
don't know if it is possible for DC2 to be more up-to-date than DC1 with
this setup...).

Now imagine DC1 looses connectivity and the client fails over to DC2.
Everything should work fine after that, with the only difference that DC2
will be now handling the requests directly from the client. After some
time, say after max_hint_window_in_ms, DC1 comes back up. My question is
how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will
that require a nodetool repair on DC1 nodes? Also, what is the answer when
the outage is  max_hint_window_in_ms instead?

Thanks in advance!

Vasilis

-- 
Kind Regards,

Vasileios Vlachos


Re: Multi-DC Environment Question

2014-05-29 Thread Ben Bromhead
Short answer:

If time elapsed  max_hint_window_in_ms then hints will stop being created. You 
will need to rely on your read consistency level, read repair and anti-entropy 
repair operations to restore consistency.

Long answer:

http://www.slideshare.net/jasedbrown/understanding-antientropy-in-cassandra

Ben Bromhead
Instaclustr | www.instaclustr.com | @instaclustr | +61 415 936 359

On 30 May 2014, at 8:40 am, Tupshin Harper tups...@tupshin.com wrote:

 When one node or DC is down, coordinator nodes being written through will 
 notice this fact and store hints (hinted handoff is the mechanism),  and 
 those hints are used to send the data that was not able to be replicated 
 initially.
 
 http://www.datastax.com/dev/blog/modern-hinted-handoff
 
 -Tupshin
 
 On May 29, 2014 6:22 PM, Vasileios Vlachos vasileiosvlac...@gmail.com 
 wrote:
 Hello All,
 
 We have plans to add a second DC to our live Cassandra environment. Currently 
 RF=3 and we read and write at QUORUM. After adding DC2 we are going to be 
 reading and writing at LOCAL_QUORUM.
 
 If my understanding is correct, when a client sends a write request, if the 
 consistency level is satisfied on DC1 (that is RF/2+1), success is returned 
 to the client and DC2 will eventually get the data as well. The assumption 
 behind this is that the the client always connects to DC1 for reads and 
 writes and given that there is a site-to-site VPN between DC1 and DC2. 
 Therefore, DC1 will almost always return success before DC2 (actually I don't 
 know if it is possible for DC2 to be more up-to-date than DC1 with this 
 setup...).
 
 Now imagine DC1 looses connectivity and the client fails over to DC2. 
 Everything should work fine after that, with the only difference that DC2 
 will be now handling the requests directly from the client. After some time, 
 say after max_hint_window_in_ms, DC1 comes back up. My question is how do I 
 bring DC1 up to speed with DC2 which is now more up-to-date? Will that 
 require a nodetool repair on DC1 nodes? Also, what is the answer when the 
 outage is  max_hint_window_in_ms instead?
 
 Thanks in advance!
 
 Vasilis
 -- 
 Kind Regards,
 
 Vasileios Vlachos