http://git-wip-us.apache.org/repos/asf/cassandra/blob/54f7335c/doc/source/operating/metrics.rst ---------------------------------------------------------------------- diff --git a/doc/source/operating/metrics.rst b/doc/source/operating/metrics.rst new file mode 100644 index 0000000..5884cad --- /dev/null +++ b/doc/source/operating/metrics.rst @@ -0,0 +1,619 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. See the License for the specific language governing permissions and +.. limitations under the License. + +.. highlight:: none + +Monitoring +---------- + +Metrics in Cassandra are managed using the `Dropwizard Metrics <http://metrics.dropwizard.io>`__ library. These metrics +can be queried via JMX or pushed to external monitoring systems using a number of `built in +<http://metrics.dropwizard.io/3.1.0/getting-started/#other-reporting>`__ and `third party +<http://metrics.dropwizard.io/3.1.0/manual/third-party/>`__ reporter plugins. + +Metrics are collected for a single node. It's up to the operator to use an external monitoring system to aggregate them. + +Metric Types +^^^^^^^^^^^^ +All metrics reported by cassandra fit into one of the following types. + +``Gauge`` + An instantaneous measurement of a value. + +``Counter`` + A gauge for an ``AtomicLong`` instance. Typically this is consumed by monitoring the change since the last call to + see if there is a large increase compared to the norm. + +``Histogram`` + Measures the statistical distribution of values in a stream of data. + + In addition to minimum, maximum, mean, etc., it also measures median, 75th, 90th, 95th, 98th, 99th, and 99.9th + percentiles. + +``Timer`` + Measures both the rate that a particular piece of code is called and the histogram of its duration. + +``Latency`` + Special type that tracks latency (in microseconds) with a ``Timer`` plus a ``Counter`` that tracks the total latency + accrued since starting. The former is useful if you track the change in total latency since the last check. Each + metric name of this type will have 'Latency' and 'TotalLatency' appended to it. + +``Meter`` + A meter metric which measures mean throughput and one-, five-, and fifteen-minute exponentially-weighted moving + average throughputs. + +Table Metrics +^^^^^^^^^^^^^ + +Each table in Cassandra has metrics responsible for tracking its state and performance. + +The metric names are all appended with the specific ``Keyspace`` and ``Table`` name. + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.Table.{{MetricName}}.{{Keyspace}}.{{Table}}`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=Table keyspace={{Keyspace} scope={{Table}} name={{MetricName}}`` + +.. NOTE:: + There is a special table called '``all``' without a keyspace. This represents the aggregation of metrics across + **all** tables and keyspaces on the node. + + +======================================= ============== =========== +Name Type Description +======================================= ============== =========== +MemtableOnHeapSize Gauge<Long> Total amount of data stored in the memtable that resides **on**-heap, including column related overhead and partitions overwritten. +MemtableOffHeapSize Gauge<Long> Total amount of data stored in the memtable that resides **off**-heap, including column related overhead and partitions overwritten. +MemtableLiveDataSize Gauge<Long> Total amount of live data stored in the memtable, excluding any data structure overhead. +AllMemtablesOnHeapSize Gauge<Long> Total amount of data stored in the memtables (2i and pending flush memtables included) that resides **on**-heap. +AllMemtablesOffHeapSize Gauge<Long> Total amount of data stored in the memtables (2i and pending flush memtables included) that resides **off**-heap. +AllMemtablesLiveDataSize Gauge<Long> Total amount of live data stored in the memtables (2i and pending flush memtables included) that resides off-heap, excluding any data structure overhead. +MemtableColumnsCount Gauge<Long> Total number of columns present in the memtable. +MemtableSwitchCount Counter Number of times flush has resulted in the memtable being switched out. +CompressionRatio Gauge<Double> Current compression ratio for all SSTables. +EstimatedPartitionSizeHistogram Gauge<long[]> Histogram of estimated partition size (in bytes). +EstimatedPartitionCount Gauge<Long> Approximate number of keys in table. +EstimatedColumnCountHistogram Gauge<long[]> Histogram of estimated number of columns. +SSTablesPerReadHistogram Histogram Histogram of the number of sstable data files accessed per read. +ReadLatency Latency Local read latency for this table. +RangeLatency Latency Local range scan latency for this table. +WriteLatency Latency Local write latency for this table. +CoordinatorReadLatency Timer Coordinator read latency for this table. +CoordinatorScanLatency Timer Coordinator range scan latency for this table. +PendingFlushes Counter Estimated number of flush tasks pending for this table. +BytesFlushed Counter Total number of bytes flushed since server [re]start. +CompactionBytesWritten Counter Total number of bytes written by compaction since server [re]start. +PendingCompactions Gauge<Integer> Estimate of number of pending compactions for this table. +LiveSSTableCount Gauge<Integer> Number of SSTables on disk for this table. +LiveDiskSpaceUsed Counter Disk space used by SSTables belonging to this table (in bytes). +TotalDiskSpaceUsed Counter Total disk space used by SSTables belonging to this table, including obsolete ones waiting to be GC'd. +MinPartitionSize Gauge<Long> Size of the smallest compacted partition (in bytes). +MaxPartitionSize Gauge<Long> Size of the largest compacted partition (in bytes). +MeanPartitionSize Gauge<Long> Size of the average compacted partition (in bytes). +BloomFilterFalsePositives Gauge<Long> Number of false positives on table's bloom filter. +BloomFilterFalseRatio Gauge<Double> False positive ratio of table's bloom filter. +BloomFilterDiskSpaceUsed Gauge<Long> Disk space used by bloom filter (in bytes). +BloomFilterOffHeapMemoryUsed Gauge<Long> Off-heap memory used by bloom filter. +IndexSummaryOffHeapMemoryUsed Gauge<Long> Off-heap memory used by index summary. +CompressionMetadataOffHeapMemoryUsed Gauge<Long> Off-heap memory used by compression meta data. +KeyCacheHitRate Gauge<Double> Key cache hit rate for this table. +TombstoneScannedHistogram Histogram Histogram of tombstones scanned in queries on this table. +LiveScannedHistogram Histogram Histogram of live cells scanned in queries on this table. +ColUpdateTimeDeltaHistogram Histogram Histogram of column update time delta on this table. +ViewLockAcquireTime Timer Time taken acquiring a partition lock for materialized view updates on this table. +ViewReadTime Timer Time taken during the local read of a materialized view update. +TrueSnapshotsSize Gauge<Long> Disk space used by snapshots of this table including all SSTable components. +RowCacheHitOutOfRange Counter Number of table row cache hits that do not satisfy the query filter, thus went to disk. +RowCacheHit Counter Number of table row cache hits. +RowCacheMiss Counter Number of table row cache misses. +CasPrepare Latency Latency of paxos prepare round. +CasPropose Latency Latency of paxos propose round. +CasCommit Latency Latency of paxos commit round. +PercentRepaired Gauge<Double> Percent of table data that is repaired on disk. +SpeculativeRetries Counter Number of times speculative retries were sent for this table. +WaitingOnFreeMemtableSpace Histogram Histogram of time spent waiting for free memtable space, either on- or off-heap. +DroppedMutations Counter Number of dropped mutations on this table. +======================================= ============== =========== + +Keyspace Metrics +^^^^^^^^^^^^^^^^ +Each keyspace in Cassandra has metrics responsible for tracking its state and performance. + +These metrics are the same as the ``Table Metrics`` above, only they are aggregated at the Keyspace level. + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.keyspace.{{MetricName}}.{{Keyspace}}`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=Keyspace scope={{Keyspace}} name={{MetricName}}`` + +ThreadPool Metrics +^^^^^^^^^^^^^^^^^^ + +Cassandra splits work of a particular type into its own thread pool. This provides back-pressure and asynchrony for +requests on a node. It's important to monitor the state of these thread pools since they can tell you how saturated a +node is. + +The metric names are all appended with the specific ``ThreadPool`` name. The thread pools are also categorized under a +specific type. + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.ThreadPools.{{MetricName}}.{{Path}}.{{ThreadPoolName}}`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=ThreadPools scope={{ThreadPoolName}} type={{Type}} name={{MetricName}}`` + +===================== ============== =========== +Name Type Description +===================== ============== =========== +ActiveTasks Gauge<Integer> Number of tasks being actively worked on by this pool. +PendingTasks Gauge<Integer> Number of queued tasks queued up on this pool. +CompletedTasks Counter Number of tasks completed. +TotalBlockedTasks Counter Number of tasks that were blocked due to queue saturation. +CurrentlyBlockedTask Counter Number of tasks that are currently blocked due to queue saturation but on retry will become unblocked. +MaxPoolSize Gauge<Integer> The maximum number of threads in this pool. +===================== ============== =========== + +The following thread pools can be monitored. + +============================ ============== =========== +Name Type Description +============================ ============== =========== +Native-Transport-Requests transport Handles client CQL requests +CounterMutationStage request Responsible for counter writes +ViewMutationStage request Responsible for materialized view writes +MutationStage request Responsible for all other writes +ReadRepairStage request ReadRepair happens on this thread pool +ReadStage request Local reads run on this thread pool +RequestResponseStage request Coordinator requests to the cluster run on this thread pool +AntiEntropyStage internal Builds merkle tree for repairs +CacheCleanupExecutor internal Cache maintenance performed on this thread pool +CompactionExecutor internal Compactions are run on these threads +GossipStage internal Handles gossip requests +HintsDispatcher internal Performs hinted handoff +InternalResponseStage internal Responsible for intra-cluster callbacks +MemtableFlushWriter internal Writes memtables to disk +MemtablePostFlush internal Cleans up commit log after memtable is written to disk +MemtableReclaimMemory internal Memtable recycling +MigrationStage internal Runs schema migrations +MiscStage internal Misceleneous tasks run here +PendingRangeCalculator internal Calculates token range +PerDiskMemtableFlushWriter_0 internal Responsible for writing a spec (there is one of these per disk 0-N) +Sampler internal Responsible for re-sampling the index summaries of SStables +SecondaryIndexManagement internal Performs updates to secondary indexes +ValidationExecutor internal Performs validation compaction or scrubbing +============================ ============== =========== + +.. |nbsp| unicode:: 0xA0 .. nonbreaking space + +Client Request Metrics +^^^^^^^^^^^^^^^^^^^^^^ + +Client requests have their own set of metrics that encapsulate the work happening at coordinator level. + +Different types of client requests are broken down by ``RequestType``. + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.ClientRequest.{{MetricName}}.{{RequestType}}`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=ClientRequest scope={{RequestType}} name={{MetricName}}`` + + +:RequestType: CASRead +:Description: Metrics related to transactional read requests. +:Metrics: + ===================== ============== ============================================================= + Name Type Description + ===================== ============== ============================================================= + Timeouts Counter Number of timeouts encountered. + Failures Counter Number of transaction failures encountered. + |nbsp| Latency Transaction read latency. + Unavailables Counter Number of unavailable exceptions encountered. + UnfinishedCommit Counter Number of transactions that were committed on read. + ConditionNotMet Counter Number of transaction preconditions did not match current values. + ContentionHistogram Histogram How many contended reads were encountered + ===================== ============== ============================================================= + +:RequestType: CASWrite +:Description: Metrics related to transactional write requests. +:Metrics: + ===================== ============== ============================================================= + Name Type Description + ===================== ============== ============================================================= + Timeouts Counter Number of timeouts encountered. + Failures Counter Number of transaction failures encountered. + |nbsp| Latency Transaction write latency. + UnfinishedCommit Counter Number of transactions that were committed on write. + ConditionNotMet Counter Number of transaction preconditions did not match current values. + ContentionHistogram Histogram How many contended writes were encountered + ===================== ============== ============================================================= + + +:RequestType: Read +:Description: Metrics related to standard read requests. +:Metrics: + ===================== ============== ============================================================= + Name Type Description + ===================== ============== ============================================================= + Timeouts Counter Number of timeouts encountered. + Failures Counter Number of read failures encountered. + |nbsp| Latency Read latency. + Unavailables Counter Number of unavailable exceptions encountered. + ===================== ============== ============================================================= + +:RequestType: RangeSlice +:Description: Metrics related to token range read requests. +:Metrics: + ===================== ============== ============================================================= + Name Type Description + ===================== ============== ============================================================= + Timeouts Counter Number of timeouts encountered. + Failures Counter Number of range query failures encountered. + |nbsp| Latency Range query latency. + Unavailables Counter Number of unavailable exceptions encountered. + ===================== ============== ============================================================= + +:RequestType: Write +:Description: Metrics related to regular write requests. +:Metrics: + ===================== ============== ============================================================= + Name Type Description + ===================== ============== ============================================================= + Timeouts Counter Number of timeouts encountered. + Failures Counter Number of write failures encountered. + |nbsp| Latency Write latency. + Unavailables Counter Number of unavailable exceptions encountered. + ===================== ============== ============================================================= + + +:RequestType: ViewWrite +:Description: Metrics related to materialized view write wrtes. +:Metrics: + ===================== ============== ============================================================= + Timeouts Counter Number of timeouts encountered. + Failures Counter Number of transaction failures encountered. + Unavailables Counter Number of unavailable exceptions encountered. + ViewReplicasAttempted Counter Total number of attempted view replica writes. + ViewReplicasSuccess Counter Total number of succeded view replica writes. + ViewPendingMutations Gauge<Long> ViewReplicasAttempted - ViewReplicasSuccess. + ViewWriteLatency Timer Time between when mutation is applied to base table and when CL.ONE is achieved on view. + ===================== ============== ============================================================= + +Cache Metrics +^^^^^^^^^^^^^ + +Cassandra caches have metrics to track the effectivness of the caches. Though the ``Table Metrics`` might be more useful. + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.Cache.{{MetricName}}.{{CacheName}}`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=Cache scope={{CacheName}} name={{MetricName}}`` + +========================== ============== =========== +Name Type Description +========================== ============== =========== +Capacity Gauge<Long> Cache capacity in bytes. +Entries Gauge<Integer> Total number of cache entries. +FifteenMinuteCacheHitRate Gauge<Double> 15m cache hit rate. +FiveMinuteCacheHitRate Gauge<Double> 5m cache hit rate. +OneMinuteCacheHitRate Gauge<Double> 1m cache hit rate. +HitRate Gauge<Double> All time cache hit rate. +Hits Meter Total number of cache hits. +Misses Meter Total number of cache misses. +MissLatency Timer Latency of misses. +Requests Gauge<Long> Total number of cache requests. +Size Gauge<Long> Total size of occupied cache, in bytes. +========================== ============== =========== + +The following caches are covered: + +============================ =========== +Name Description +============================ =========== +CounterCache Keeps hot counters in memory for performance. +ChunkCache In process uncompressed page cache. +KeyCache Cache for partition to sstable offsets. +RowCache Cache for rows kept in memory. +============================ =========== + +.. NOTE:: + Misses and MissLatency are only defined for the ChunkCache + +CQL Metrics +^^^^^^^^^^^ + +Metrics specific to CQL prepared statement caching. + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.CQL.{{MetricName}}`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=CQL name={{MetricName}}`` + +========================== ============== =========== +Name Type Description +========================== ============== =========== +PreparedStatementsCount Gauge<Integer> Number of cached prepared statements. +PreparedStatementsEvicted Counter Number of prepared statements evicted from the prepared statement cache +PreparedStatementsExecuted Counter Number of prepared statements executed. +RegularStatementsExecuted Counter Number of **non** prepared statements executed. +PreparedStatementsRatio Gauge<Double> Percentage of statements that are prepared vs unprepared. +========================== ============== =========== + + +DroppedMessage Metrics +^^^^^^^^^^^^^^^^^^^^^^ + +Metrics specific to tracking dropped messages for different types of requests. +Dropped writes are stored and retried by ``Hinted Handoff`` + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.DroppedMessages.{{MetricName}}.{{Type}}`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=DroppedMetrics scope={{Type}} name={{MetricName}}`` + +========================== ============== =========== +Name Type Description +========================== ============== =========== +CrossNodeDroppedLatency Timer The dropped latency across nodes. +InternalDroppedLatency Timer The dropped latency within node. +Dropped Meter Number of dropped messages. +========================== ============== =========== + +The different types of messages tracked are: + +============================ =========== +Name Description +============================ =========== +BATCH_STORE Batchlog write +BATCH_REMOVE Batchlog cleanup (after succesfully applied) +COUNTER_MUTATION Counter writes +HINT Hint replay +MUTATION Regular writes +READ Regular reads +READ_REPAIR Read repair +PAGED_SLICE Paged read +RANGE_SLICE Token range read +REQUEST_RESPONSE RPC Callbacks +_TRACE Tracing writes +============================ =========== + +Streaming Metrics +^^^^^^^^^^^^^^^^^ + +Metrics reported during ``Streaming`` operations, such as repair, bootstrap, rebuild. + +These metrics are specific to a peer endpoint, with the source node being the node you are pulling the metrics from. + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.Streaming.{{MetricName}}.{{PeerIP}}`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=Streaming scope={{PeerIP}} name={{MetricName}}`` + +========================== ============== =========== +Name Type Description +========================== ============== =========== +IncomingBytes Counter Number of bytes streamed to this node from the peer. +OutgoingBytes Counter Number of bytes streamed to the peer endpoint from this node. +========================== ============== =========== + + +Compaction Metrics +^^^^^^^^^^^^^^^^^^ + +Metrics specific to ``Compaction`` work. + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.Compaction.{{MetricName}}`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=Compaction name={{MetricName}}`` + +========================== ======================================== =============================================== +Name Type Description +========================== ======================================== =============================================== +BytesCompacted Counter Total number of bytes compacted since server [re]start. +PendingTasks Gauge<Integer> Estimated number of compactions remaining to perform. +CompletedTasks Gauge<Long> Number of completed compactions since server [re]start. +TotalCompactionsCompleted Meter Throughput of completed compactions since server [re]start. +PendingTasksByTableName Gauge<Map<String, Map<String, Integer>>> Estimated number of compactions remaining to perform, grouped by keyspace and then table name. This info is also kept in ``Table Metrics``. +========================== ======================================== =============================================== + +CommitLog Metrics +^^^^^^^^^^^^^^^^^ + +Metrics specific to the ``CommitLog`` + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.CommitLog.{{MetricName}}`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=CommitLog name={{MetricName}}`` + +========================== ============== =========== +Name Type Description +========================== ============== =========== +CompletedTasks Gauge<Long> Total number of commit log messages written since [re]start. +PendingTasks Gauge<Long> Number of commit log messages written but yet to be fsync'd. +TotalCommitLogSize Gauge<Long> Current size, in bytes, used by all the commit log segments. +WaitingOnSegmentAllocation Timer Time spent waiting for a CommitLogSegment to be allocated - under normal conditions this should be zero. +WaitingOnCommit Timer The time spent waiting on CL fsync; for Periodic this is only occurs when the sync is lagging its sync interval. +========================== ============== =========== + +Storage Metrics +^^^^^^^^^^^^^^^ + +Metrics specific to the storage engine. + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.Storage.{{MetricName}}`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=Storage name={{MetricName}}`` + +========================== ============== =========== +Name Type Description +========================== ============== =========== +Exceptions Counter Number of internal exceptions caught. Under normal exceptions this should be zero. +Load Counter Size, in bytes, of the on disk data size this node manages. +TotalHints Counter Number of hint messages written to this node since [re]start. Includes one entry for each host to be hinted per hint. +TotalHintsInProgress Counter Number of hints attemping to be sent currently. +========================== ============== =========== + +HintedHandoff Metrics +^^^^^^^^^^^^^^^^^^^^^ + +Metrics specific to Hinted Handoff. There are also some metrics related to hints tracked in ``Storage Metrics`` + +These metrics include the peer endpoint **in the metric name** + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.HintedHandOffManager.{{MetricName}}`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=HintedHandOffManager name={{MetricName}}`` + +=========================== ============== =========== +Name Type Description +=========================== ============== =========== +Hints_created-{{PeerIP}} Counter Number of hints on disk for this peer. +Hints_not_stored-{{PeerIP}} Counter Number of hints not stored for this peer, due to being down past the configured hint window. +=========================== ============== =========== + +SSTable Index Metrics +^^^^^^^^^^^^^^^^^^^^^ + +Metrics specific to the SSTable index metadata. + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.Index.{{MetricName}}.RowIndexEntry`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=Index scope=RowIndexEntry name={{MetricName}}`` + +=========================== ============== =========== +Name Type Description +=========================== ============== =========== +IndexedEntrySize Histogram Histogram of the on-heap size, in bytes, of the index across all SSTables. +IndexInfoCount Histogram Histogram of the number of on-heap index entries managed across all SSTables. +IndexInfoGets Histogram Histogram of the number index seeks performed per SSTable. +=========================== ============== =========== + +BufferPool Metrics +^^^^^^^^^^^^^^^^^^ + +Metrics specific to the internal recycled buffer pool Cassandra manages. This pool is meant to keep allocations and GC +lower by recycling on and off heap buffers. + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.BufferPool.{{MetricName}}`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=BufferPool name={{MetricName}}`` + +=========================== ============== =========== +Name Type Description +=========================== ============== =========== +Size Gauge<Long> Size, in bytes, of the managed buffer pool +Misses Meter The rate of misses in the pool. The higher this is the more allocations incurred. +=========================== ============== =========== + + +Client Metrics +^^^^^^^^^^^^^^ + +Metrics specifc to client managment. + +Reported name format: + +**Metric Name** + ``org.apache.cassandra.metrics.Client.{{MetricName}}`` + +**JMX MBean** + ``org.apache.cassandra.metrics:type=Client name={{MetricName}}`` + +=========================== ============== =========== +Name Type Description +=========================== ============== =========== +connectedNativeClients Counter Number of clients connected to this nodes native protocol server +connectedThriftClients Counter Number of clients connected to this nodes thrift protocol server +=========================== ============== =========== + +JMX +^^^ + +Any JMX based client can access metrics from cassandra. + +If you wish to access JMX metrics over http it's possible to download `Mx4jTool <http://mx4j.sourceforge.net/>`__ and +place ``mx4j-tools.jar`` into the classpath. On startup you will see in the log:: + + HttpAdaptor version 3.0.2 started on port 8081 + +To choose a different port (8081 is the default) or a different listen address (0.0.0.0 is not the default) edit +``conf/cassandra-env.sh`` and uncomment:: + + #MX4J_ADDRESS="-Dmx4jaddress=0.0.0.0" + + #MX4J_PORT="-Dmx4jport=8081" + + +Metric Reporters +^^^^^^^^^^^^^^^^ + +As mentioned at the top of this section on monitoring the Cassandra metrics can be exported to a number of monitoring +system a number of `built in <http://metrics.dropwizard.io/3.1.0/getting-started/#other-reporting>`__ and `third party +<http://metrics.dropwizard.io/3.1.0/manual/third-party/>`__ reporter plugins. + +The configuration of these plugins is managed by the `metrics reporter config project +<https://github.com/addthis/metrics-reporter-config>`__. There is a sample configuration file located at +``conf/metrics-reporter-config-sample.yaml``. + +Once configured, you simply start cassandra with the flag +``-Dcassandra.metricsReporterConfigFile=metrics-reporter-config.yaml``. The specified .yaml file plus any 3rd party +reporter jars must all be in Cassandra's classpath.
http://git-wip-us.apache.org/repos/asf/cassandra/blob/54f7335c/doc/source/operating/read_repair.rst ---------------------------------------------------------------------- diff --git a/doc/source/operating/read_repair.rst b/doc/source/operating/read_repair.rst new file mode 100644 index 0000000..0e52bf5 --- /dev/null +++ b/doc/source/operating/read_repair.rst @@ -0,0 +1,22 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. See the License for the specific language governing permissions and +.. limitations under the License. + +.. highlight:: none + +Read repair +----------- + +.. todo:: todo http://git-wip-us.apache.org/repos/asf/cassandra/blob/54f7335c/doc/source/operating/repair.rst ---------------------------------------------------------------------- diff --git a/doc/source/operating/repair.rst b/doc/source/operating/repair.rst new file mode 100644 index 0000000..97d8ce8 --- /dev/null +++ b/doc/source/operating/repair.rst @@ -0,0 +1,22 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. See the License for the specific language governing permissions and +.. limitations under the License. + +.. highlight:: none + +Repair +------ + +.. todo:: todo http://git-wip-us.apache.org/repos/asf/cassandra/blob/54f7335c/doc/source/operating/security.rst ---------------------------------------------------------------------- diff --git a/doc/source/operating/security.rst b/doc/source/operating/security.rst new file mode 100644 index 0000000..80a33f4 --- /dev/null +++ b/doc/source/operating/security.rst @@ -0,0 +1,410 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. See the License for the specific language governing permissions and +.. limitations under the License. + +.. highlight:: none + +Security +-------- + +There are three main components to the security features provided by Cassandra: + +- TLS/SSL encryption for client and inter-node communication +- Client authentication +- Authorization + +TLS/SSL Encryption +^^^^^^^^^^^^^^^^^^ +Cassandra provides secure communication between a client machine and a database cluster and between nodes within a +cluster. Enabling encryption ensures that data in flight is not compromised and is transferred securely. The options for +client-to-node and node-to-node encryption are managed separately and may be configured independently. + +In both cases, the JVM defaults for supported protocols and cipher suites are used when encryption is enabled. These can +be overidden using the settings in ``cassandra.yaml``, but this is not recommended unless there are policies in place +which dictate certain settings or a need to disable vulnerable ciphers or protocols in cases where the JVM cannot be +updated. + +FIPS compliant settings can be configured at the JVM level and should not involve changing encryption settings in +cassandra.yaml. See `the java document on FIPS <https://docs.oracle.com/javase/8/docs/technotes/guides/security/jsse/FIPS.html>`__ +for more details. + +For information on generating the keystore and truststore files used in SSL communications, see the +`java documentation on creating keystores <http://download.oracle.com/javase/6/docs/technotes/guides/security/jsse/JSSERefGuide.html#CreateKeystore>`__ + +Inter-node Encryption +~~~~~~~~~~~~~~~~~~~~~ + +The settings for managing inter-node encryption are found in ``cassandra.yaml`` in the ``server_encryption_options`` +section. To enable inter-node encryption, change the ``internode_encryption`` setting from its default value of ``none`` +to one value from: ``rack``, ``dc`` or ``all``. + +Client to Node Encryption +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The settings for managing client to node encryption are found in ``cassandra.yaml`` in the ``client_encryption_options`` +section. There are two primary toggles here for enabling encryption, ``enabled`` and ``optional``. + +- If neither is set to ``true``, client connections are entirely unencrypted. +- If ``enabled`` is set to ``true`` and ``optional`` is set to ``false``, all client connections must be secured. +- If both options are set to ``true``, both encrypted and unencrypted connections are supported using the same port. + Client connections using encryption with this configuration will be automatically detected and handled by the server. + +As an alternative to the ``optional`` setting, separate ports can also be configured for secure and unsecure connections +where operational requirements demand it. To do so, set ``optional`` to false and use the ``native_transport_port_ssl`` +setting in ``cassandra.yaml`` to specify the port to be used for secure client communication. + +.. _operation-roles: + +Roles +^^^^^ + +Cassandra uses database roles, which may represent either a single user or a group of users, in both authentication and +permissions management. Role management is an extension point in Cassandra and may be configured using the +``role_manager`` setting in ``cassandra.yaml``. The default setting uses ``CassandraRoleManager``, an implementation +which stores role information in the tables of the ``system_auth`` keyspace. + +See also the :ref:`CQL documentation on roles <roles>`. + +Authentication +^^^^^^^^^^^^^^ + +Authentication is pluggable in Cassandra and is configured using the ``authenticator`` setting in ``cassandra.yaml``. +Cassandra ships with two options included in the default distribution. + +By default, Cassandra is configured with ``AllowAllAuthenticator`` which performs no authentication checks and therefore +requires no credentials. It is used to disable authentication completely. Note that authentication is a necessary +condition of Cassandra's permissions subsystem, so if authentication is disabled, effectively so are permissions. + +The default distribution also includes ``PasswordAuthenticator``, which stores encrypted credentials in a system table. +This can be used to enable simple username/password authentication. + +.. _password-authentication: + +Enabling Password Authentication +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Before enabling client authentication on the cluster, client applications should be pre-configured with their intended +credentials. When a connection is initiated, the server will only ask for credentials once authentication is +enabled, so setting up the client side config in advance is safe. In contrast, as soon as a server has authentication +enabled, any connection attempt without proper credentials will be rejected which may cause availability problems for +client applications. Once clients are setup and ready for authentication to be enabled, follow this procedure to enable +it on the cluster. + +Pick a single node in the cluster on which to perform the initial configuration. Ideally, no clients should connect +to this node during the setup process, so you may want to remove it from client config, block it at the network level +or possibly add a new temporary node to the cluster for this purpose. On that node, perform the following steps: + +1. Open a ``cqlsh`` session and change the replication factor of the ``system_auth`` keyspace. By default, this keyspace + uses ``SimpleReplicationStrategy`` and a ``replication_factor`` of 1. It is recommended to change this for any + non-trivial deployment to ensure that should nodes become unavailable, login is still possible. Best practice is to + configure a replication factor of 3 to 5 per-DC. + +:: + + ALTER KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': 3, 'DC2': 3}; + +2. Edit ``cassandra.yaml`` to change the ``authenticator`` option like so: + +:: + + authenticator: PasswordAuthenticator + +3. Restart the node. + +4. Open a new ``cqlsh`` session using the credentials of the default superuser: + +:: + + cqlsh -u cassandra -p cassandra + +5. During login, the credentials for the default superuser are read with a consistency level of ``QUORUM``, whereas + those for all other users (including superusers) are read at ``LOCAL_ONE``. In the interests of performance and + availability, as well as security, operators should create another superuser and disable the default one. This step + is optional, but highly recommended. While logged in as the default superuser, create another superuser role which + can be used to bootstrap further configuration. + +:: + + # create a new superuser + CREATE ROLE dba WITH SUPERUSER = true AND LOGIN = true AND PASSWORD = 'super'; + +6. Start a new cqlsh session, this time logging in as the new_superuser and disable the default superuser. + +:: + + ALTER ROLE cassandra WITH SUPERUSER = false AND LOGIN = false; + +7. Finally, set up the roles and credentials for your application users with :ref:`CREATE ROLE <create-role-statement>` + statements. + +At the end of these steps, the one node is configured to use password authentication. To roll that out across the +cluster, repeat steps 2 and 3 on each node in the cluster. Once all nodes have been restarted, authentication will be +fully enabled throughout the cluster. + +Note that using ``PasswordAuthenticator`` also requires the use of :ref:`CassandraRoleManager <operation-roles>`. + +See also: :ref:`setting-credentials-for-internal-authentication`, :ref:`CREATE ROLE <create-role-statement>`, +:ref:`ALTER ROLE <alter-role-statement>`, :ref:`ALTER KEYSPACE <calter-keyspace-statement>` and :ref:`GRANT PERMISSION +<create-permission-statement>`, + +Authorization +^^^^^^^^^^^^^ + +Authorization is pluggable in Cassandra and is configured using the ``authorizer`` setting in ``cassandra.yaml``. +Cassandra ships with two options included in the default distribution. + +By default, Cassandra is configured with ``AllowAllAuthorizer`` which performs no checking and so effectively grants all +permissions to all roles. This must be used if ``AllowAllAuthenticator`` is the configured authenticator. + +The default distribution also includes ``CassandraAuthorizer``, which does implement full permissions management +functionality and stores its data in Cassandra system tables. + +Enabling Internal Authorization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Permissions are modelled as a whitelist, with the default assumption that a given role has no access to any database +resources. The implication of this is that once authorization is enabled on a node, all requests will be rejected until +the required permissions have been granted. For this reason, it is strongly recommended to perform the initial setup on +a node which is not processing client requests. + +The following assumes that authentication has already been enabled via the process outlined in +:ref:`password-authentication`. Perform these steps to enable internal authorization across the cluster: + +1. On the selected node, edit ``cassandra.yaml`` to change the ``authorizer`` option like so: + +:: + + authorizer: CassandraAuthorizer + +2. Restart the node. + +3. Open a new ``cqlsh`` session using the credentials of a role with superuser credentials: + +:: + + cqlsh -u dba -p super + +4. Configure the appropriate access privileges for your clients using `GRANT PERMISSION <cql.html#grant-permission>`_ + statements. On the other nodes, until configuration is updated and the node restarted, this will have no effect so + disruption to clients is avoided. + +:: + + GRANT SELECT ON ks.t1 TO db_user; + +5. Once all the necessary permissions have been granted, repeat steps 1 and 2 for each node in turn. As each node + restarts and clients reconnect, the enforcement of the granted permissions will begin. + +See also: :ref:`GRANT PERMISSION <grant-permission-statement>`, `GRANT ALL <grant-all>` and :ref:`REVOKE PERMISSION +<revoke-permission-statement>` + +Caching +^^^^^^^ + +Enabling authentication and authorization places additional load on the cluster by frequently reading from the +``system_auth`` tables. Furthermore, these reads are in the critical paths of many client operations, and so has the +potential to severely impact quality of service. To mitigate this, auth data such as credentials, permissions and role +details are cached for a configurable period. The caching can be configured (and even disabled) from ``cassandra.yaml`` +or using a JMX client. The JMX interface also supports invalidation of the various caches, but any changes made via JMX +are not persistent and will be re-read from ``cassandra.yaml`` when the node is restarted. + +Each cache has 3 options which can be set: + +Validity Period + Controls the expiration of cache entries. After this period, entries are invalidated and removed from the cache. +Refresh Rate + Controls the rate at which background reads are performed to pick up any changes to the underlying data. While these + async refreshes are performed, caches will continue to serve (possibly) stale data. Typically, this will be set to a + shorter time than the validity period. +Max Entries + Controls the upper bound on cache size. + +The naming for these options in ``cassandra.yaml`` follows the convention: + +* ``<type>_validity_in_ms`` +* ``<type>_update_interval_in_ms`` +* ``<type>_cache_max_entries`` + +Where ``<type>`` is one of ``credentials``, ``permissions``, or ``roles``. + +As mentioned, these are also exposed via JMX in the mbeans under the ``org.apache.cassandra.auth`` domain. + +JMX access +^^^^^^^^^^ + +Access control for JMX clients is configured separately to that for CQL. For both authentication and authorization, two +providers are available; the first based on standard JMX security and the second which integrates more closely with +Cassandra's own auth subsystem. + +The default settings for Cassandra make JMX accessible only from localhost. To enable remote JMX connections, edit +``cassandra-env.sh`` (or ``cassandra-env.ps1`` on Windows) to change the ``LOCAL_JMX`` setting to ``yes``. Under the +standard configuration, when remote JMX connections are enabled, :ref:`standard JMX authentication <standard-jmx-auth>` +is also switched on. + +Note that by default, local-only connections are not subject to authentication, but this can be enabled. + +If enabling remote connections, it is recommended to also use :ref:`SSL <jmx-with-ssl>` connections. + +Finally, after enabling auth and/or SSL, ensure that tools which use JMX, such as :ref:`nodetool <nodetool>`, are +correctly configured and working as expected. + +.. _standard-jmx-auth: + +Standard JMX Auth +~~~~~~~~~~~~~~~~~ + +Users permitted to connect to the JMX server are specified in a simple text file. The location of this file is set in +``cassandra-env.sh`` by the line: + +:: + + JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.password.file=/etc/cassandra/jmxremote.password" + +Edit the password file to add username/password pairs: + +:: + + jmx_user jmx_password + +Secure the credentials file so that only the user running the Cassandra process can read it : + +:: + + $ chown cassandra:cassandra /etc/cassandra/jmxremote.password + $ chmod 400 /etc/cassandra/jmxremote.password + +Optionally, enable access control to limit the scope of what defined users can do via JMX. Note that this is a fairly +blunt instrument in this context as most operational tools in Cassandra require full read/write access. To configure a +simple access file, uncomment this line in ``cassandra-env.sh``: + +:: + + #JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.access.file=/etc/cassandra/jmxremote.access" + +Then edit the access file to grant your JMX user readwrite permission: + +:: + + jmx_user readwrite + +Cassandra must be restarted to pick up the new settings. + +See also : `Using File-Based Password Authentication In JMX +<http://docs.oracle.com/javase/7/docs/technotes/guides/management/agent.html#gdenv>`__ + + +Cassandra Integrated Auth +~~~~~~~~~~~~~~~~~~~~~~~~~ + +An alternative to the out-of-the-box JMX auth is to useeCassandra's own authentication and/or authorization providers +for JMX clients. This is potentially more flexible and secure but it come with one major caveat. Namely that it is not +available until `after` a node has joined the ring, because the auth subsystem is not fully configured until that point +However, it is often critical for monitoring purposes to have JMX access particularly during bootstrap. So it is +recommended, where possible, to use local only JMX auth during bootstrap and then, if remote connectivity is required, +to switch to integrated auth once the node has joined the ring and initial setup is complete. + +With this option, the same database roles used for CQL authentication can be used to control access to JMX, so updates +can be managed centrally using just ``cqlsh``. Furthermore, fine grained control over exactly which operations are +permitted on particular MBeans can be acheived via :ref:`GRANT PERMISSION <cgrant-permission-statement>`. + +To enable integrated authentication, edit ``cassandra-env.sh`` to uncomment these lines: + +:: + + #JVM_OPTS="$JVM_OPTS -Dcassandra.jmx.remote.login.config=CassandraLogin" + #JVM_OPTS="$JVM_OPTS -Djava.security.auth.login.config=$CASSANDRA_HOME/conf/cassandra-jaas.config" + +And disable the JMX standard auth by commenting this line: + +:: + + JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.password.file=/etc/cassandra/jmxremote.password" + +To enable integrated authorization, uncomment this line: + +:: + + #JVM_OPTS="$JVM_OPTS -Dcassandra.jmx.authorizer=org.apache.cassandra.auth.jmx.AuthorizationProxy" + +Check standard access control is off by ensuring this line is commented out: + +:: + + #JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.access.file=/etc/cassandra/jmxremote.access" + +With integrated authentication and authorization enabled, operators can define specific roles and grant them access to +the particular JMX resources that they need. For example, a role with the necessary permissions to use tools such as +jconsole or jmc in read-only mode would be defined as: + +:: + + CREATE ROLE jmx WITH LOGIN = false; + GRANT SELECT ON ALL MBEANS TO jmx; + GRANT DESCRIBE ON ALL MBEANS TO jmx; + GRANT EXECUTE ON MBEAN 'java.lang:type=Threading' TO jmx; + GRANT EXECUTE ON MBEAN 'com.sun.management:type=HotSpotDiagnostic' TO jmx; + + # Grant the jmx role to one with login permissions so that it can access the JMX tooling + CREATE ROLE ks_user WITH PASSWORD = 'password' AND LOGIN = true AND SUPERUSER = false; + GRANT jmx TO ks_user; + +Fine grained access control to individual MBeans is also supported: + +:: + + GRANT EXECUTE ON MBEAN 'org.apache.cassandra.db:type=Tables,keyspace=test_keyspace,table=t1' TO ks_user; + GRANT EXECUTE ON MBEAN 'org.apache.cassandra.db:type=Tables,keyspace=test_keyspace,table=*' TO ks_owner; + +This permits the ``ks_user`` role to invoke methods on the MBean representing a single table in ``test_keyspace``, while +granting the same permission for all table level MBeans in that keyspace to the ``ks_owner`` role. + +Adding/removing roles and granting/revoking of permissions is handled dynamically once the initial setup is complete, so +no further restarts are required if permissions are altered. + +See also: :ref:`Permissions <permissions>`. + +.. _jmx-with-ssl: + +JMX With SSL +~~~~~~~~~~~~ + +JMX SSL configuration is controlled by a number of system properties, some of which are optional. To turn on SSL, edit +the relevant lines in ``cassandra-env.sh`` (or ``cassandra-env.ps1`` on Windows) to uncomment and set the values of these +properties as required: + +``com.sun.management.jmxremote.ssl`` + set to true to enable SSL +``com.sun.management.jmxremote.ssl.need.client.auth`` + set to true to enable validation of client certificates +``com.sun.management.jmxremote.registry.ssl`` + enables SSL sockets for the RMI registry from which clients obtain the JMX connector stub +``com.sun.management.jmxremote.ssl.enabled.protocols`` + by default, the protocols supported by the JVM will be used, override with a comma-separated list. Note that this is + not usually necessary and using the defaults is the preferred option. +``com.sun.management.jmxremote.ssl.enabled.cipher.suites`` + by default, the cipher suites supported by the JVM will be used, override with a comma-separated list. Note that + this is not usually necessary and using the defaults is the preferred option. +``javax.net.ssl.keyStore`` + set the path on the local filesystem of the keystore containing server private keys and public certificates +``javax.net.ssl.keyStorePassword`` + set the password of the keystore file +``javax.net.ssl.trustStore`` + if validation of client certificates is required, use this property to specify the path of the truststore containing + the public certificates of trusted clients +``javax.net.ssl.trustStorePassword`` + set the password of the truststore file + +See also: `Oracle Java7 Docs <http://docs.oracle.com/javase/7/docs/technotes/guides/management/agent.html#gdemv>`__, +`Monitor Java with JMX <https://www.lullabot.com/articles/monitor-java-with-jmx>`__ http://git-wip-us.apache.org/repos/asf/cassandra/blob/54f7335c/doc/source/operating/snitch.rst ---------------------------------------------------------------------- diff --git a/doc/source/operating/snitch.rst b/doc/source/operating/snitch.rst new file mode 100644 index 0000000..faea0b3 --- /dev/null +++ b/doc/source/operating/snitch.rst @@ -0,0 +1,78 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. See the License for the specific language governing permissions and +.. limitations under the License. + +.. highlight:: none + +Snitch +------ + +In cassandra, the snitch has two functions: + +- it teaches Cassandra enough about your network topology to route requests efficiently. +- it allows Cassandra to spread replicas around your cluster to avoid correlated failures. It does this by grouping + machines into "datacenters" and "racks." Cassandra will do its best not to have more than one replica on the same + "rack" (which may not actually be a physical location). + +Dynamic snitching +^^^^^^^^^^^^^^^^^ + +The dynamic snitch monitor read latencies to avoid reading from hosts that have slowed down. The dynamic snitch is +configured with the following properties on ``cassandra.yaml``: + +- ``dynamic_snitch``: whether the dynamic snitch should be enabled or disabled. +- ``dynamic_snitch_update_interval_in_ms``: controls how often to perform the more expensive part of host score + calculation. +- ``dynamic_snitch_reset_interval_in_ms``: if set greater than zero and read_repair_chance is < 1.0, this will allow + 'pinning' of replicas to hosts in order to increase cache capacity. +- ``dynamic_snitch_badness_threshold:``: The badness threshold will control how much worse the pinned host has to be + before the dynamic snitch will prefer other replicas over it. This is expressed as a double which represents a + percentage. Thus, a value of 0.2 means Cassandra would continue to prefer the static snitch values until the pinned + host was 20% worse than the fastest. + +Snitch classes +^^^^^^^^^^^^^^ + +The ``endpoint_snitch`` parameter in ``cassandra.yaml`` should be set to the class the class that implements +``IEndPointSnitch`` which will be wrapped by the dynamic snitch and decide if two endpoints are in the same data center +or on the same rack. Out of the box, Cassandra provides the snitch implementations: + +GossipingPropertyFileSnitch + This should be your go-to snitch for production use. The rack and datacenter for the local node are defined in + cassandra-rackdc.properties and propagated to other nodes via gossip. If ``cassandra-topology.properties`` exists, + it is used as a fallback, allowing migration from the PropertyFileSnitch. + +SimpleSnitch + Treats Strategy order as proximity. This can improve cache locality when disabling read repair. Only appropriate for + single-datacenter deployments. + +PropertyFileSnitch + Proximity is determined by rack and data center, which are explicitly configured in + ``cassandra-topology.properties``. + +Ec2Snitch + Appropriate for EC2 deployments in a single Region. Loads Region and Availability Zone information from the EC2 API. + The Region is treated as the datacenter, and the Availability Zone as the rack. Only private IPs are used, so this + will not work across multiple regions. + +Ec2MultiRegionSnitch + Uses public IPs as broadcast_address to allow cross-region connectivity (thus, you should set seed addresses to the + public IP as well). You will need to open the ``storage_port`` or ``ssl_storage_port`` on the public IP firewall + (For intra-Region traffic, Cassandra will switch to the private IP after establishing a connection). + +RackInferringSnitch + Proximity is determined by rack and data center, which are assumed to correspond to the 3rd and 2nd octet of each + node's IP address, respectively. Unless this happens to match your deployment conventions, this is best used as an + example of writing a custom Snitch class and is provided in that spirit. http://git-wip-us.apache.org/repos/asf/cassandra/blob/54f7335c/doc/source/operating/topo_changes.rst ---------------------------------------------------------------------- diff --git a/doc/source/operating/topo_changes.rst b/doc/source/operating/topo_changes.rst new file mode 100644 index 0000000..9d6a2ba --- /dev/null +++ b/doc/source/operating/topo_changes.rst @@ -0,0 +1,122 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. See the License for the specific language governing permissions and +.. limitations under the License. + +.. highlight:: none + +Adding, replacing, moving and removing nodes +-------------------------------------------- + +Bootstrap +^^^^^^^^^ + +Adding new nodes is called "bootstrapping". The ``num_tokens`` parameter will define the amount of virtual nodes +(tokens) the joining node will be assigned during bootstrap. The tokens define the sections of the ring (token ranges) +the node will become responsible for. + +Token allocation +~~~~~~~~~~~~~~~~ + +With the default token allocation algorithm the new node will pick ``num_tokens`` random tokens to become responsible +for. Since tokens are distributed randomly, load distribution improves with a higher amount of virtual nodes, but it +also increases token management overhead. The default of 256 virtual nodes should provide a reasonable load balance with +acceptable overhead. + +On 3.0+ a new token allocation algorithm was introduced to allocate tokens based on the load of existing virtual nodes +for a given keyspace, and thus yield an improved load distribution with a lower number of tokens. To use this approach, +the new node must be started with the JVM option ``-Dcassandra.allocate_tokens_for_keyspace=<keyspace>``, where +``<keyspace>`` is the keyspace from which the algorithm can find the load information to optimize token assignment for. + +Manual token assignment +""""""""""""""""""""""" + +You may specify a comma-separated list of tokens manually with the ``initial_token`` ``cassandra.yaml`` parameter, and +if that is specified Cassandra will skip the token allocation process. This may be useful when doing token assignment +with an external tool or when restoring a node with its previous tokens. + +Range streaming +~~~~~~~~~~~~~~~~ + +After the tokens are allocated, the joining node will pick current replicas of the token ranges it will become +responsible for to stream data from. By default it will stream from the primary replica of each token range in order to +guarantee data in the new node will be consistent with the current state. + +In the case of any unavailable replica, the consistent bootstrap process will fail. To override this behavior and +potentially miss data from an unavailable replica, set the JVM flag ``-Dcassandra.consistent.rangemovement=false``. + +Resuming failed/hanged bootstrap +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On 2.2+, if the bootstrap process fails, it's possible to resume bootstrap from the previous saved state by calling +``nodetool bootstrap resume``. If for some reason the bootstrap hangs or stalls, it may also be resumed by simply +restarting the node. In order to cleanup bootstrap state and start fresh, you may set the JVM startup flag +``-Dcassandra.reset_bootstrap_progress=true``. + +On lower versions, when the bootstrap proces fails it is recommended to wipe the node (remove all the data), and restart +the bootstrap process again. + +Manual bootstrapping +~~~~~~~~~~~~~~~~~~~~ + +It's possible to skip the bootstrapping process entirely and join the ring straight away by setting the hidden parameter +``auto_bootstrap: false``. This may be useful when restoring a node from a backup or creating a new data-center. + +Removing nodes +^^^^^^^^^^^^^^ + +You can take a node out of the cluster with ``nodetool decommission`` to a live node, or ``nodetool removenode`` (to any +other machine) to remove a dead one. This will assign the ranges the old node was responsible for to other nodes, and +replicate the appropriate data there. If decommission is used, the data will stream from the decommissioned node. If +removenode is used, the data will stream from the remaining replicas. + +No data is removed automatically from the node being decommissioned, so if you want to put the node back into service at +a different token on the ring, it should be removed manually. + +Moving nodes +^^^^^^^^^^^^ + +When ``num_tokens: 1`` it's possible to move the node position in the ring with ``nodetool move``. Moving is both a +convenience over and more efficient than decommission + bootstrap. After moving a node, ``nodetool cleanup`` should be +run to remove any unnecessary data. + +Replacing a dead node +^^^^^^^^^^^^^^^^^^^^^ + +In order to replace a dead node, start cassandra with the JVM startup flag +``-Dcassandra.replace_address_first_boot=<dead_node_ip>``. Once this property is enabled the node starts in a hibernate +state, during which all the other nodes will see this node to be down. + +The replacing node will now start to bootstrap the data from the rest of the nodes in the cluster. The main difference +between normal bootstrapping of a new node is that this new node will not accept any writes during this phase. + +Once the bootstrapping is complete the node will be marked "UP", we rely on the hinted handoff's for making this node +consistent (since we don't accept writes since the start of the bootstrap). + +.. Note:: If the replacement process takes longer than ``max_hint_window_in_ms`` you **MUST** run repair to make the + replaced node consistent again, since it missed ongoing writes during bootstrapping. + +Monitoring progress +^^^^^^^^^^^^^^^^^^^ + +Bootstrap, replace, move and remove progress can be monitored using ``nodetool netstats`` which will show the progress +of the streaming operations. + +Cleanup data after range movements +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +As a safety measure, Cassandra does not automatically remove data from nodes that "lose" part of their token range due +to a range movement operation (bootstrap, move, replace). Run ``nodetool cleanup`` on the nodes that lost ranges to the +joining node when you are satisfied the new node is up and working. If you do not do this the old data will still be +counted against the load on that node.
