[Cassandra Wiki] Update of "Operations" by JonathanEllis

Apache Wiki Tue, 17 May 2011 08:51:03 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "Operations" page has been changed by JonathanEllis.
The comment on this change is: add NTS notes to token selection.
http://wiki.apache.org/cassandra/Operations?action=diff&rev1=89&rev2=90

--------------------------------------------------

  === Token selection ===
  Using a strong hash function means !RandomPartitioner keys will, on average, 
be evenly spread across the Token space, but you can still have imbalances if 
your Tokens do not divide up the range evenly, so you should specify 
!InitialToken to your first nodes as `i * (2**127 / N)` for i = 0 .. N-1. In 
Cassandra 0.7, you should specify `initial_token` in `cassandra.yaml`.
  
+ With !NetworkTopologyStrategy, you should calculate the tokens the nodes in 
each DC independantly. Tokens still neded to be unique, so you can add 1 to the 
tokens in the 2nd DC, add 2 in the 3rd, and so on.  Thus, for a 4-node cluster 
in 2 datacenters, you would have
+ {{{
+ DC1
+ node 1 = 0
+ node 2 = 85070591730234615865843651857942052864
+ 
+ DC2
+ node 1 = 1
+ node 2 = 85070591730234615865843651857942052865
+ }}}
+ 
  With order preserving partitioners, your key distribution will be 
application-dependent.  You should still take your best guess at specifying 
initial tokens (guided by sampling actual data, if possible), but you will be 
more dependent on active load balancing (see below) and/or adding new nodes to 
hot spots.
  
  Once data is placed on the cluster, the partitioner may not be changed 
without wiping and starting over.
@@ -40, +51 @@

  Replication factor is not really intended to be changed in a live cluster 
either, but increasing it is conceptually simple: update the replication_factor 
from the CLI (see below), then run repair against each node in your cluster so 
that all the new replicas that are supposed to have the data, actually do.
  
  Until repair is finished, you have 3 options:
+ 
   * read at ConsistencyLevel.QUORUM or ALL (depending on your existing 
replication factor) to make sure that a replica that actually has the data is 
consulted
   * continue reading at lower CL, accepting that some requests will fail 
(usually only the first for a given query, if ReadRepair is enabled)
   * take downtime while repair runs
@@ -49, +61 @@

  Reducing replication factor is easily done and only requires running cleanup 
afterwards to remove extra replicas.
  
  To update the replication factor on a live cluster, forget about 
cassandra.yaml. Rather you want to use '''cassandra-cli''':
+ 
-     update keyspace Keyspace1 with replication_factor = 3;
+  . update keyspace Keyspace1 with replication_factor = 3;
  
  === Network topology ===
  Besides datacenters, you can also tell Cassandra which nodes are in the same 
rack within a datacenter.  Cassandra will use this to route both reads and data 
movement for Range changes to the nearest replicas.  This is configured by a 
user-pluggable !EndpointSnitch class in the configuration file.
@@ -97, +110 @@

  
  Here's a python program which can be used to calculate new tokens for the 
nodes. There's more info on the subject at Ben Black's presentation at 
Cassandra Summit 2010. 
http://www.datastax.com/blog/slides-and-videos-cassandra-summit-2010
  
-   def tokens(nodes):                        
+  . def tokens(nodes):
-       for x in xrange(nodes):         
+   . for x in xrange(nodes):
-           print 2 ** 127 / nodes * x
+    . print 2 ** 127 / nodes * x
  
- In versions of Cassandra 0.7.* and lower, there's also `nodetool 
loadbalance`: essentially a convenience over decommission + bootstrap, only 
instead of telling the target node where to move on the ring it will choose its 
location based on the same heuristic as Token selection on bootstrap. You 
should not use this as it doesn't rebalance the entire ring. 
+ In versions of Cassandra 0.7.* and lower, there's also `nodetool 
loadbalance`: essentially a convenience over decommission + bootstrap, only 
instead of telling the target node where to move on the ring it will choose its 
location based on the same heuristic as Token selection on bootstrap. You 
should not use this as it doesn't rebalance the entire ring.
  
- The status of move and balancing operations can be monitored using `nodetool` 
with the `netstat` argument. 
+ The status of move and balancing operations can be monitored using `nodetool` 
with the `netstat` argument.  (Cassandra 0.6.* and lower use the `streams` 
argument).
- (Cassandra 0.6.* and lower use the `streams` argument).
  
  == Consistency ==
  Cassandra allows clients to specify the desired consistency level on reads 
and writes.  (See [[API]].)  If R + W > N, where R, W, and N are respectively 
the read replica count, the write replica count, and the replication factor, 
all client reads will see the most recent write.  Otherwise, readers '''may''' 
see older versions, for periods of typically a few ms; this is called "eventual 
consistency."  See 
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html and 
http://queue.acm.org/detail.cfm?id=1466448 for more.
@@ -115, +127 @@

  Cassandra repairs data in two ways:
  
   1. Read Repair: every time a read is performed, Cassandra compares the 
versions at each replica (in the background, if a low consistency was requested 
by the reader to minimize latency), and the newest version is sent to any 
out-of-date replicas.
-  1. Anti-Entropy: when `nodetool repair` is run, Cassandra computes a Merkle 
tree for each range of data on that node, and compares it with the versions on 
other replicas, to catch any out of sync data that hasn't been read recently.  
This is intended to be run infrequently (e.g., weekly) since computing the 
Merkle tree is relatively expensive in disk i/o and CPU, since it scans ALL the 
data on the machine (but it is is very network efficient).  
+  1. Anti-Entropy: when `nodetool repair` is run, Cassandra computes a Merkle 
tree for each range of data on that node, and compares it with the versions on 
other replicas, to catch any out of sync data that hasn't been read recently.  
This is intended to be run infrequently (e.g., weekly) since computing the 
Merkle tree is relatively expensive in disk i/o and CPU, since it scans ALL the 
data on the machine (but it is is very network efficient).
  
- Running `nodetool repair`:
- Like all nodetool operations in 0.7, repair is blocking: it will wait for the 
repair to finish and then exit.  This may take a long time on large data sets.
+ Running `nodetool repair`: Like all nodetool operations in 0.7, repair is 
blocking: it will wait for the repair to finish and then exit.  This may take a 
long time on large data sets.
  
  It is safe to run repair against multiple machines at the same time, but to 
minimize the impact on your application workload it is recommended to wait for 
it to complete on one node before invoking it against the next.
  
  === Frequency of nodetool repair ===
- 
- Unless your application performs no deletes, it is vital that production 
clusters run `nodetool repair` periodically on all nodes in the cluster. The 
hard requirement for repair frequency is the value used for GCGraceSeconds (see 
[[DistributedDeletes]]). Running nodetool repair often enough to guarantee that 
all nodes have performed a repair in a given period GCGraceSeconds long, 
ensures that deletes are not "forgotten" in the cluster.
+ Unless your application performs no deletes, it is vital that production 
clusters run `nodetool repair` periodically on all nodes in the cluster. The 
hard requirement for repair frequency is the value used for GCGraceSeconds (see 
DistributedDeletes). Running nodetool repair often enough to guarantee that all 
nodes have performed a repair in a given period GCGraceSeconds long, ensures 
that deletes are not "forgotten" in the cluster.
  
  Consider how to schedule your repairs. A repair causes additional disk and 
CPU activity on the nodes participating in the repair, and it will typically be 
a good idea to spread repairs out over time so as to minimize the chances of 
repairs running concurrently on many nodes.
  
  ==== Dealing with the consequences of nodetool repair not running within 
GCGraceSeconds ====
- 
- If `nodetool repair` has not been run often enough to the pointthat 
GCGraceSeconds has passed, you risk forgotten deletes (see 
[[DistributedDeletes]]). In addition to data popping up that has been deleted, 
you may see inconsistencies in data return from different nodes that will not 
self-heal by read-repair or further `nodetool repair`. Some further details on 
this latter effect is documented in 
[[https://issues.apache.org/jira/browse/CASSANDRA-1316|CASSANDRA-1316]].
+ If `nodetool repair` has not been run often enough to the pointthat 
GCGraceSeconds has passed, you risk forgotten deletes (see DistributedDeletes). 
In addition to data popping up that has been deleted, you may see 
inconsistencies in data return from different nodes that will not self-heal by 
read-repair or further `nodetool repair`. Some further details on this latter 
effect is documented in 
[[https://issues.apache.org/jira/browse/CASSANDRA-1316|CASSANDRA-1316]].
  
  There are at least three ways to deal with this scenario.
  
   1. Treat the node in question as failed, and replace it as described further 
below.
-  2. To minimize the amount of forgotten deletes, first increase 
GCGraceSeconds across the cluster (rolling restart required), perform a full 
repair on all nodes, and then change GCRaceSeconds back again. This has the 
advantage of ensuring tombstones spread as much as possible, minimizing the 
amount of data that may "pop back up" (forgotten delete).
+  1. To minimize the amount of forgotten deletes, first increase 
GCGraceSeconds across the cluster (rolling restart required), perform a full 
repair on all nodes, and then change GCRaceSeconds back again. This has the 
advantage of ensuring tombstones spread as much as possible, minimizing the 
amount of data that may "pop back up" (forgotten delete).
-  3. Yet another option, that will result in more forgotten deletes than the 
previous suggestion but is easier to do, is to ensure 'nodetool repair' has 
been run on all nodes, and then perform a compaction to expire toombstones. 
Following this, read-repair and regular `nodetool repair` should cause the 
cluster to converge.
+  1. Yet another option, that will result in more forgotten deletes than the 
previous suggestion but is easier to do, is to ensure 'nodetool repair' has 
been run on all nodes, and then perform a compaction to expire toombstones. 
Following this, read-repair and regular `nodetool repair` should cause the 
cluster to converge.
  
  === Handling failure ===
  If a node goes down and comes back up, the ordinary repair mechanisms will be 
adequate to deal with any inconsistent data.  Remember though that if a node 
misses updates and is not repaired for longer than your configured 
GCGraceSeconds (default: 10 days), it could have missed remove operations 
permanently.  Unless your application performs no removes, you should wipe its 
data directory, re-bootstrap it, and removetoken its old entry in the ring (see 
below).
@@ -158, +167 @@

  {{{
  Exception in thread "main" java.io.IOException: Cannot run program "ln": 
java.io.IOException: error=12, Cannot allocate memory
  }}}
- This is caused by the operating system trying to allocate the child "ln" 
process a memory space as large as the parent process (the cassandra server), 
even though '''it's not going to use it'''. So if you have a machine with 8GB 
of RAM and no swap, and you gave 6GB to the cassandra server, it will fail 
during this because the operating system wants 12 GB of virtual memory before 
allowing you to create the process. 
+ This is caused by the operating system trying to allocate the child "ln" 
process a memory space as large as the parent process (the cassandra server), 
even though '''it's not going to use it'''. So if you have a machine with 8GB 
of RAM and no swap, and you gave 6GB to the cassandra server, it will fail 
during this because the operating system wants 12 GB of virtual memory before 
allowing you to create the process.
  
  This error can be worked around by either :
  
@@ -167, +176 @@

  OR
  
   * creating a swap file, snapshotting, removing swap file
+ 
  OR
+ 
   * turning on "memory overcommit"
  
  To restore a snapshot:
@@ -210, +221 @@

  
  Running `nodetool cfstats` can provide an overview of each Column Family, and 
important metrics to graph your cluster. Some folks prefer having to deal with 
non-jmx clients, there is a JMX-to-REST bridge available at 
http://code.google.com/p/polarrose-jmx-rest-bridge/
  
- Important metrics to watch on a per-Column Family basis would be: '''Read 
Count, Read Latency, Write Count and Write Latency'''. '''Pending Tasks''' tell 
you if things are backing up. These metrics can also be exposed using any JMX 
client such as `jconsole`.  (See also 
[[http://simplygenius.com/2010/08/jconsole-via-socks-ssh-tunnel.html]] for how 
to proxy JConsole to firewalled machines.)
+ Important metrics to watch on a per-Column Family basis would be: '''Read 
Count, Read Latency, Write Count and Write Latency'''. '''Pending Tasks''' tell 
you if things are backing up. These metrics can also be exposed using any JMX 
client such as `jconsole`.  (See also 
http://simplygenius.com/2010/08/jconsole-via-socks-ssh-tunnel.html for how to 
proxy JConsole to firewalled machines.)
  
  You can also use jconsole, and the MBeans tab to look at PendingTasks for 
thread pools. If you see one particular thread backing up, this can give you an 
indication of a problem. One example would be ROW-MUTATION-STAGE indicating 
that write requests are arriving faster than they can be handled. A more subtle 
example is the FLUSH stages: if these start backing up, cassandra is accepting 
writes into memory fast enough, but the sort-and-write-to-disk stages are 
falling behind.
  
@@ -240, +251 @@

  FLUSH-WRITER-POOL                 0         0            218
  HINTED-HANDOFF-POOL               0         0            154
  }}}
- 
  == Monitoring with MX4J ==
- mx4j provides an HTML and HTTP interface to JMX. Starting from version 0.7.0 
cassandra lets you hook up mx4j very easily.
+ mx4j provides an HTML and HTTP interface to JMX. Starting from version 0.7.0 
cassandra lets you hook up mx4j very easily. To enable mx4j on a Cassandra node:
- To enable mx4j on a Cassandra node:
+ 
   * Download mx4j-tools.jar from http://mx4j.sourceforge.net/
   * Add mx4j-tools.jar to the classpath (e.g. under lib/)
   * Start cassandra
-  * In the log you should see a message such as Http``Atapter started on port 
8081
+  * In the log you should see a message such as HttpAtapter started on port 
8081
   * To choose a different port (8081 is the default) or a different listen 
address (0.0.0.0 is not the default) edit conf/cassandra-env.sh and uncomment 
#MX4J_ADDRESS="-Dmx4jaddress=0.0.0.0" and #MX4J_PORT="-Dmx4jport=8081"
  
  Now browse to http://cassandra:8081/ and use the HTML interface.

[Cassandra Wiki] Update of "Operations" by JonathanEllis

Reply via email to