Repository: cassandra Updated Branches: refs/heads/trunk 3d66a81dc -> 57b87d21a
Add basic troubleshooting documentation This is just a start, but should help people get out of the most basic of reliability/performance issues. Includes information on isolating the node or nodes causing an issue, as well as detailed information on how to debug a correctness/reliability issue on a particular node. Patch by Joseph Lynch, reviewed by Jon Haddad for CASSANDRA-14553 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/57b87d21 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/57b87d21 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/57b87d21 Branch: refs/heads/trunk Commit: 57b87d21aa4712f63903a191cff6c832c759b97b Parents: 3d66a81 Author: Joseph Lynch <[email protected]> Authored: Sun Jul 1 17:56:08 2018 -0700 Committer: Jon Haddad <[email protected]> Committed: Thu Sep 20 09:16:17 2018 -0700 ---------------------------------------------------------------------- doc/source/operating/metrics.rst | 5 + doc/source/troubleshooting/finding_nodes.rst | 149 ++++++ doc/source/troubleshooting/index.rst | 21 +- doc/source/troubleshooting/reading_logs.rst | 267 +++++++++++ doc/source/troubleshooting/use_nodetool.rst | 245 ++++++++++ doc/source/troubleshooting/use_tools.rst | 542 ++++++++++++++++++++++ 6 files changed, 1228 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/cassandra/blob/57b87d21/doc/source/operating/metrics.rst ---------------------------------------------------------------------- diff --git a/doc/source/operating/metrics.rst b/doc/source/operating/metrics.rst index 325395c..4f61bf2 100644 --- a/doc/source/operating/metrics.rst +++ b/doc/source/operating/metrics.rst @@ -16,6 +16,8 @@ .. highlight:: none +.. _monitoring-metrics: + Monitoring ---------- @@ -55,6 +57,8 @@ All metrics reported by cassandra fit into one of the following types. A meter metric which measures mean throughput and one-, five-, and fifteen-minute exponentially-weighted moving average throughputs. +.. _table-metrics: + Table Metrics ^^^^^^^^^^^^^ @@ -397,6 +401,7 @@ RegularStatementsExecuted Counter Number of **non** prepared statements PreparedStatementsRatio Gauge<Double> Percentage of statements that are prepared vs unprepared. ========================== ============== =========== +.. _dropped-metrics: DroppedMessage Metrics ^^^^^^^^^^^^^^^^^^^^^^ http://git-wip-us.apache.org/repos/asf/cassandra/blob/57b87d21/doc/source/troubleshooting/finding_nodes.rst ---------------------------------------------------------------------- diff --git a/doc/source/troubleshooting/finding_nodes.rst b/doc/source/troubleshooting/finding_nodes.rst new file mode 100644 index 0000000..df5e16c --- /dev/null +++ b/doc/source/troubleshooting/finding_nodes.rst @@ -0,0 +1,149 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. See the License for the specific language governing permissions and +.. limitations under the License. + +Find The Misbehaving Nodes +========================== + +The first step to troubleshooting a Cassandra issue is to use error messages, +metrics and monitoring information to identify if the issue lies with the +clients or the server and if it does lie with the server find the problematic +nodes in the Cassandra cluster. The goal is to determine if this is a systemic +issue (e.g. a query pattern that affects the entire cluster) or isolated to a +subset of nodes (e.g. neighbors holding a shared token range or even a single +node with bad hardware). + +There are many sources of information that help determine where the problem +lies. Some of the most common are mentioned below. + +Client Logs and Errors +---------------------- +Clients of the cluster often leave the best breadcrumbs to follow. Perhaps +client latencies or error rates have increased in a particular datacenter +(likely eliminating other datacenter's nodes), or clients are receiving a +particular kind of error code indicating a particular kind of problem. +Troubleshooters can often rule out many failure modes just by reading the error +messages. In fact, many Cassandra error messages include the last coordinator +contacted to help operators find nodes to start with. + +Some common errors (likely culprit in parenthesis) assuming the client has +similar error names as the Datastax :ref:`drivers <client-drivers>`: + +* ``SyntaxError`` (**client**). This and other ``QueryValidationException`` + indicate that the client sent a malformed request. These are rarely server + issues and usually indicate bad queries. +* ``UnavailableException`` (**server**): This means that the Cassandra + coordinator node has rejected the query as it believes that insufficent + replica nodes are available. If many coordinators are throwing this error it + likely means that there really are (typically) multiple nodes down in the + cluster and you can identify them using :ref:`nodetool status + <nodetool-status>` If only a single coordinator is throwing this error it may + mean that node has been partitioned from the rest. +* ``OperationTimedOutException`` (**server**): This is the most frequent + timeout message raised when clients set timeouts and means that the query + took longer than the supplied timeout. This is a *client side* timeout + meaning that it took longer than the client specified timeout. The error + message will include the coordinator node that was last tried which is + usually a good starting point. This error usually indicates either + aggressive client timeout values or latent server coordinators/replicas. +* ``ReadTimeoutException`` or ``WriteTimeoutException`` (**server**): These + are raised when clients do not specify lower timeouts and there is a + *coordinator* timeouts based on the values supplied in the ``cassandra.yaml`` + configuration file. They usually indicate a serious server side problem as + the default values are usually multiple seconds. + +Metrics +------- + +If you have Cassandra :ref:`metrics <monitoring-metrics>` reporting to a +centralized location such as `Graphite <https://graphiteapp.org/>`_ or +`Grafana <https://grafana.com/>`_ you can typically use those to narrow down +the problem. At this stage narrowing down the issue to a particular +datacenter, rack, or even group of nodes is the main goal. Some helpful metrics +to look at are: + +Errors +^^^^^^ +Cassandra refers to internode messaging errors as "drops", and provided a +number of :ref:`Dropped Message Metrics <dropped-metrics>` to help narrow +down errors. If particular nodes are dropping messages actively, they are +likely related to the issue. + +Latency +^^^^^^^ +For timeouts or latency related issues you can start with :ref:`Table +Metrics <table-metrics>` by comparing Coordinator level metrics e.g. +``CoordinatorReadLatency`` or ``CoordinatorWriteLatency`` with their associated +replica metrics e.g. ``ReadLatency`` or ``WriteLatency``. Issues usually show +up on the ``99th`` percentile before they show up on the ``50th`` percentile or +the ``mean``. While ``maximum`` coordinator latencies are not typically very +helpful due to the exponentially decaying reservoir used internally to produce +metrics, ``maximum`` replica latencies that correlate with increased ``99th`` +percentiles on coordinators can help narrow down the problem. + +There are usually three main possibilities: + +1. Coordinator latencies are high on all nodes, but only a few node's local + read latencies are high. This points to slow replica nodes and the + coordinator's are just side-effects. This usually happens when clients are + not token aware. +2. Coordinator latencies and replica latencies increase at the + same time on the a few nodes. If clients are token aware this is almost + always what happens and points to slow replicas of a subset of token + ranges (only part of the ring). +3. Coordinator and local latencies are high on many nodes. This usually + indicates either a tipping point in the cluster capacity (too many writes or + reads per second), or a new query pattern. + +It's important to remember that depending on the client's load balancing +behavior and consistency levels coordinator and replica metrics may or may +not correlate. In particular if you use ``TokenAware`` policies the same +node's coordinator and replica latencies will often increase together, but if +you just use normal ``DCAwareRoundRobin`` coordinator latencies can increase +with unrelated replica node's latencies. For example: + +* ``TokenAware`` + ``LOCAL_ONE``: should always have coordinator and replica + latencies on the same node rise together +* ``TokenAware`` + ``LOCAL_QUORUM``: should always have coordinator and + multiple replica latencies rise together in the same datacenter. +* ``TokenAware`` + ``QUORUM``: replica latencies in other datacenters can + affect coordinator latencies. +* ``DCAwareRoundRobin`` + ``LOCAL_ONE``: coordinator latencies and unrelated + replica node's latencies will rise together. +* ``DCAwareRoundRobin`` + ``LOCAL_QUORUM``: different coordinator and replica + latencies will rise together with little correlation. + +Query Rates +^^^^^^^^^^^ +Sometimes the :ref:`Table <table-metrics>` query rate metrics can help +narrow down load issues as "small" increase in coordinator queries per second +(QPS) may correlate with a very large increase in replica level QPS. This most +often happens with ``BATCH`` writes, where a client may send a single ``BATCH`` +query that might contain 50 statements in it, which if you have 9 copies (RF=3, +three datacenters) means that every coordinator ``BATCH`` write turns into 450 +replica writes! This is why keeping ``BATCH``'s to the same partition is so +critical, otherwise you can exhaust significant CPU capacitity with a "single" +query. + + +Next Step: Investigate the Node(s) +---------------------------------- + +Once you have narrowed down the problem as much as possible (datacenter, rack +, node), login to one of the nodes using SSH and proceed to debug using +:ref:`logs <reading-logs>`, :ref:`nodetool <use-nodetool>`, and +:ref:`os tools <use-os-tools>`. If you are not able to login you may still +have access to :ref:`logs <reading-logs>` and :ref:`nodetool <use-nodetool>` +remotely. http://git-wip-us.apache.org/repos/asf/cassandra/blob/57b87d21/doc/source/troubleshooting/index.rst ---------------------------------------------------------------------- diff --git a/doc/source/troubleshooting/index.rst b/doc/source/troubleshooting/index.rst index 2e5cf10..79b46d6 100644 --- a/doc/source/troubleshooting/index.rst +++ b/doc/source/troubleshooting/index.rst @@ -17,4 +17,23 @@ Troubleshooting =============== -.. TODO: todo +As any distributed database does, sometimes Cassandra breaks and you will have +to troubleshoot what is going on. Generally speaking you can debug Cassandra +like any other distributed Java program, meaning that you have to find which +machines in your cluster are misbehaving and then isolate the problem using +logs and tools. Luckily Cassandra had a great set of instrospection tools to +help you. + +These pages include a number of command examples demonstrating various +debugging and analysis techniques, mostly for Linux/Unix systems. If you don't +have access to the machines running Cassandra, or are running on Windows or +another operating system you may not be able to use the exact commands but +there are likely equivalent tools you can use. + +.. toctree:: + :maxdepth: 2 + + finding_nodes + reading_logs + use_nodetool + use_tools http://git-wip-us.apache.org/repos/asf/cassandra/blob/57b87d21/doc/source/troubleshooting/reading_logs.rst ---------------------------------------------------------------------- diff --git a/doc/source/troubleshooting/reading_logs.rst b/doc/source/troubleshooting/reading_logs.rst new file mode 100644 index 0000000..08f7d4d --- /dev/null +++ b/doc/source/troubleshooting/reading_logs.rst @@ -0,0 +1,267 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. See the License for the specific language governing permissions and +.. limitations under the License. + +.. _reading-logs: + +Cassandra Logs +============== +Cassandra has rich support for logging and attempts to give operators maximum +insight into the database while at the same time limiting noise to the logs. + +Common Log Files +---------------- +Cassandra has three main logs, the ``system.log``, ``debug.log`` and +``gc.log`` which hold general logging messages, debugging logging messages, and +java garbage collection logs respectively. + +These logs by default live in ``${CASSANDRA_HOME}/logs``, but most Linux +distributions relocate logs to ``/var/log/cassandra``. Operators can tune +this location as well as what levels are logged using the provided +``logback.xml`` file. + +``system.log`` +^^^^^^^^^^^^^^ +This log is the default Cassandra log and is a good place to start any +investigation. Some examples of activities logged to this log: + +* Uncaught exceptions. These can be very useful for debugging errors. +* ``GCInspector`` messages indicating long garbage collector pauses. When long + pauses happen Cassandra will print how long and also what was the state of + the system (thread state) at the time of that pause. This can help narrow + down a capacity issue (either not enough heap or not enough spare CPU). +* Information about nodes joining and leaving the cluster as well as token + metadata (data ownersip) changes. This is useful for debugging network + partitions, data movements, and more. +* Keyspace/Table creation, modification, deletion. +* ``StartupChecks`` that ensure optimal configuration of the operating system + to run Cassandra +* Information about some background operational tasks (e.g. Index + Redistribution). + +As with any application, looking for ``ERROR`` or ``WARN`` lines can be a +great first step:: + + $ # Search for warnings or errors in the latest system.log + $ grep 'WARN\|ERROR' system.log | tail + ... + + $ # Search for warnings or errors in all rotated system.log + $ zgrep 'WARN\|ERROR' system.log.* | less + ... + +``debug.log`` +^^^^^^^^^^^^^^ +This log contains additional debugging information that may be useful when +troubleshooting but may be much noiser than the normal ``system.log``. Some +examples of activities logged to this log: + +* Information about compactions, including when they start, which sstables + they contain, and when they finish. +* Information about memtable flushes to disk, including when they happened, + how large the flushes were, and which commitlog segments the flush impacted. + +This log can be *very* noisy, so it is highly recommended to use ``grep`` and +other log analysis tools to dive deep. For example:: + + $ # Search for messages involving a CompactionTask with 5 lines of context + $ grep CompactionTask debug.log -C 5 + ... + + $ # Look at the distribution of flush tasks per keyspace + $ grep "Enqueuing flush" debug.log | cut -f 10 -d ' ' | sort | uniq -c + 6 compaction_history: + 1 test_keyspace: + 2 local: + 17 size_estimates: + 17 sstable_activity: + + +``gc.log`` +^^^^^^^^^^^^^^ +The gc log is a standard Java GC log. With the default ``jvm.options`` +settings you get a lot of valuable information in this log such as +application pause times, and why pauses happened. This may help narrow +down throughput or latency issues to a mistuned JVM. For example you can +view the last few pauses:: + + $ grep stopped gc.log.0.current | tail + 2018-08-29T00:19:39.522+0000: 3022663.591: Total time for which application threads were stopped: 0.0332813 seconds, Stopping threads took: 0.0008189 seconds + 2018-08-29T00:19:44.369+0000: 3022668.438: Total time for which application threads were stopped: 0.0312507 seconds, Stopping threads took: 0.0007025 seconds + 2018-08-29T00:19:49.796+0000: 3022673.865: Total time for which application threads were stopped: 0.0307071 seconds, Stopping threads took: 0.0006662 seconds + 2018-08-29T00:19:55.452+0000: 3022679.521: Total time for which application threads were stopped: 0.0309578 seconds, Stopping threads took: 0.0006832 seconds + 2018-08-29T00:20:00.127+0000: 3022684.197: Total time for which application threads were stopped: 0.0310082 seconds, Stopping threads took: 0.0007090 seconds + 2018-08-29T00:20:06.583+0000: 3022690.653: Total time for which application threads were stopped: 0.0317346 seconds, Stopping threads took: 0.0007106 seconds + 2018-08-29T00:20:10.079+0000: 3022694.148: Total time for which application threads were stopped: 0.0299036 seconds, Stopping threads took: 0.0006889 seconds + 2018-08-29T00:20:15.739+0000: 3022699.809: Total time for which application threads were stopped: 0.0078283 seconds, Stopping threads took: 0.0006012 seconds + 2018-08-29T00:20:15.770+0000: 3022699.839: Total time for which application threads were stopped: 0.0301285 seconds, Stopping threads took: 0.0003789 seconds + 2018-08-29T00:20:15.798+0000: 3022699.867: Total time for which application threads were stopped: 0.0279407 seconds, Stopping threads took: 0.0003627 seconds + + +This shows a lot of valuable information including how long the application +was paused (meaning zero user queries were being serviced during the e.g. 33ms +JVM pause) as well as how long it took to enter the safepoint. You can use this +raw data to e.g. get the longest pauses:: + + $ grep stopped gc.log.0.current | cut -f 11 -d ' ' | sort -n | tail | xargs -IX grep X gc.log.0.current | sort -k 1 + 2018-08-28T17:13:40.520-0700: 1.193: Total time for which application threads were stopped: 0.0157914 seconds, Stopping threads took: 0.0000355 seconds + 2018-08-28T17:13:41.206-0700: 1.879: Total time for which application threads were stopped: 0.0249811 seconds, Stopping threads took: 0.0000318 seconds + 2018-08-28T17:13:41.638-0700: 2.311: Total time for which application threads were stopped: 0.0561130 seconds, Stopping threads took: 0.0000328 seconds + 2018-08-28T17:13:41.677-0700: 2.350: Total time for which application threads were stopped: 0.0362129 seconds, Stopping threads took: 0.0000597 seconds + 2018-08-28T17:13:41.781-0700: 2.454: Total time for which application threads were stopped: 0.0442846 seconds, Stopping threads took: 0.0000238 seconds + 2018-08-28T17:13:41.976-0700: 2.649: Total time for which application threads were stopped: 0.0377115 seconds, Stopping threads took: 0.0000250 seconds + 2018-08-28T17:13:42.172-0700: 2.845: Total time for which application threads were stopped: 0.0475415 seconds, Stopping threads took: 0.0001018 seconds + 2018-08-28T17:13:42.825-0700: 3.498: Total time for which application threads were stopped: 0.0379155 seconds, Stopping threads took: 0.0000571 seconds + 2018-08-28T17:13:43.574-0700: 4.247: Total time for which application threads were stopped: 0.0323812 seconds, Stopping threads took: 0.0000574 seconds + 2018-08-28T17:13:44.602-0700: 5.275: Total time for which application threads were stopped: 0.0238975 seconds, Stopping threads took: 0.0000788 seconds + +In this case any client waiting on a query would have experienced a `56ms` +latency at 17:13:41. + +Note that GC pauses are not _only_ garbage collection, although +generally speaking high pauses with fast safepoints indicate a lack of JVM heap +or mistuned JVM GC algorithm. High pauses with slow safepoints typically +indicate that the JVM is having trouble entering a safepoint which usually +indicates slow disk drives (Cassandra makes heavy use of memory mapped reads +which the JVM doesn't know could have disk latency, so the JVM safepoint logic +doesn't handle a blocking memory mapped read particularly well). + +Using these logs you can even get a pause distribution with something like +`histogram.py <https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py>`_:: + + $ grep stopped gc.log.0.current | cut -f 11 -d ' ' | sort -n | histogram.py + # NumSamples = 410293; Min = 0.00; Max = 11.49 + # Mean = 0.035346; Variance = 0.002216; SD = 0.047078; Median 0.036498 + # each â represents a count of 5470 + 0.0001 - 1.1496 [410255]: âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ + 1.1496 - 2.2991 [ 15]: + 2.2991 - 3.4486 [ 5]: + 3.4486 - 4.5981 [ 1]: + 4.5981 - 5.7475 [ 5]: + 5.7475 - 6.8970 [ 9]: + 6.8970 - 8.0465 [ 1]: + 8.0465 - 9.1960 [ 0]: + 9.1960 - 10.3455 [ 0]: + 10.3455 - 11.4949 [ 2]: + +We can see in this case while we have very good average performance something +is causing multi second JVM pauses ... In this case it was mostly safepoint +pauses caused by slow disks:: + + $ grep stopped gc.log.0.current | cut -f 11 -d ' ' | sort -n | tail | xargs -IX grep X gc.log.0.current| sort -k 1 + 2018-07-27T04:52:27.413+0000: 187831.482: Total time for which application threads were stopped: 6.5037022 seconds, Stopping threads took: 0.0005212 seconds + 2018-07-30T23:38:18.354+0000: 514582.423: Total time for which application threads were stopped: 6.3262938 seconds, Stopping threads took: 0.0004882 seconds + 2018-08-01T02:37:48.380+0000: 611752.450: Total time for which application threads were stopped: 10.3879659 seconds, Stopping threads took: 0.0004475 seconds + 2018-08-06T22:04:14.990+0000: 1113739.059: Total time for which application threads were stopped: 6.0917409 seconds, Stopping threads took: 0.0005553 seconds + 2018-08-14T00:04:06.091+0000: 1725730.160: Total time for which application threads were stopped: 6.0141054 seconds, Stopping threads took: 0.0004976 seconds + 2018-08-17T06:23:06.755+0000: 2007670.824: Total time for which application threads were stopped: 6.0133694 seconds, Stopping threads took: 0.0006011 seconds + 2018-08-23T06:35:46.068+0000: 2526830.137: Total time for which application threads were stopped: 6.4767751 seconds, Stopping threads took: 6.4426849 seconds + 2018-08-23T06:36:29.018+0000: 2526873.087: Total time for which application threads were stopped: 11.4949489 seconds, Stopping threads took: 11.4638297 seconds + 2018-08-23T06:37:12.671+0000: 2526916.741: Total time for which application threads were stopped: 6.3867003 seconds, Stopping threads took: 6.3507166 seconds + 2018-08-23T06:37:47.156+0000: 2526951.225: Total time for which application threads were stopped: 7.9528200 seconds, Stopping threads took: 7.9197756 seconds + +Sometimes reading and understanding java GC logs is hard, but you can take the +raw GC files and visualize them using tools such as `GCViewer +<https://github.com/chewiebug/GCViewer>`_ which take the Cassandra GC log as +input and show you detailed visual information on your garbage collection +performance. This includes pause analysis as well as throughput information. +For a stable Cassandra JVM you probably want to aim for pauses less than +`200ms` and GC throughput greater than `99%` (ymmv). + +Java GC pauses are one of the leading causes of tail latency in Cassandra +(along with drive latency) so sometimes this information can be crucial +while debugging tail latency issues. + + +Getting More Information +------------------------ + +If the default logging levels are insuficient, ``nodetool`` can set higher +or lower logging levels for various packages and classes using the +``nodetool setlogginglevel`` command. Start by viewing the current levels:: + + $ nodetool getlogginglevels + + Logger Name Log Level + ROOT INFO + org.apache.cassandra DEBUG + +Perhaps the ``Gossiper`` is acting up and we wish to enable it at ``TRACE`` +level for even more insight:: + + + $ nodetool setlogginglevel org.apache.cassandra.gms.Gossiper TRACE + + $ nodetool getlogginglevels + + Logger Name Log Level + ROOT INFO + org.apache.cassandra DEBUG + org.apache.cassandra.gms.Gossiper TRACE + + $ grep TRACE debug.log | tail -2 + TRACE [GossipStage:1] 2018-07-04 17:07:47,879 Gossiper.java:1234 - Updating + heartbeat state version to 2344 from 2343 for 127.0.0.2:7000 ... + TRACE [GossipStage:1] 2018-07-04 17:07:47,879 Gossiper.java:923 - local + heartbeat version 2341 greater than 2340 for 127.0.0.1:7000 + + +Note that any changes made this way are reverted on next Cassandra process +restart. To make the changes permanent add the appropriate rule to +``logback.xml``. + +.. code-block:: diff + + diff --git a/conf/logback.xml b/conf/logback.xml + index b2c5b10..71b0a49 100644 + --- a/conf/logback.xml + +++ b/conf/logback.xml + @@ -98,4 +98,5 @@ appender reference in the root level section below. + </root> + + <logger name="org.apache.cassandra" level="DEBUG"/> + + <logger name="org.apache.cassandra.gms.Gossiper" level="TRACE"/> + </configuration> + +Full Query Logger +^^^^^^^^^^^^^^^^^ + +Cassandra 4.0 additionally ships with support for full query logging. This +is a highly performant binary logging tool which captures Cassandra queries +in real time, writes them (if possible) to a log file, and ensures the total +size of the capture does not exceed a particular limit. FQL is enabled with +``nodetool`` and the logs are read with the provided ``bin/fqltool`` utility:: + + $ mkdir /var/tmp/fql_logs + $ nodetool enablefullquerylog --path /var/tmp/fql_logs + + # ... do some querying + + $ bin/fqltool dump /var/tmp/fql_logs/20180705-00.cq4 | tail + Query time: 1530750927224 + Query: SELECT * FROM system_virtual_schema.columns WHERE keyspace_name = + 'system_views' AND table_name = 'sstable_tasks'; + Values: + + Type: single + Protocol version: 4 + Query time: 1530750934072 + Query: select * from keyspace1.standard1 ; + Values: + + $ nodetool disablefullquerylog + +Note that if you want more information than this tool provides, there are other +live capture options available such as :ref:`packet capture <packet-capture>`. http://git-wip-us.apache.org/repos/asf/cassandra/blob/57b87d21/doc/source/troubleshooting/use_nodetool.rst ---------------------------------------------------------------------- diff --git a/doc/source/troubleshooting/use_nodetool.rst b/doc/source/troubleshooting/use_nodetool.rst new file mode 100644 index 0000000..5072f85 --- /dev/null +++ b/doc/source/troubleshooting/use_nodetool.rst @@ -0,0 +1,245 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. See the License for the specific language governing permissions and +.. limitations under the License. + +.. _use-nodetool: + +Use Nodetool +============ + +Cassandra's ``nodetool`` allows you to narrow problems from the cluster down +to a particular node and gives a lot of insight into the state of the Cassandra +process itself. There are dozens of useful commands (see ``nodetool help`` +for all the commands), but briefly some of the most useful for troubleshooting: + +.. _nodetool-status: + +Cluster Status +-------------- + +You can use ``nodetool status`` to assess status of the cluster:: + + $ nodetool status <optional keyspace> + + Datacenter: dc1 + ======================= + Status=Up/Down + |/ State=Normal/Leaving/Joining/Moving + -- Address Load Tokens Owns (effective) Host ID Rack + UN 127.0.1.1 4.69 GiB 1 100.0% 35ea8c9f-b7a2-40a7-b9c5-0ee8b91fdd0e r1 + UN 127.0.1.2 4.71 GiB 1 100.0% 752e278f-b7c5-4f58-974b-9328455af73f r2 + UN 127.0.1.3 4.69 GiB 1 100.0% 9dc1a293-2cc0-40fa-a6fd-9e6054da04a7 r3 + +In this case we can see that we have three nodes in one datacenter with about +4.6GB of data each and they are all "up". The up/down status of a node is +independently determined by every node in the cluster, so you may have to run +``nodetool status`` on multiple nodes in a cluster to see the full view. + +You can use ``nodetool status`` plus a little grep to see which nodes are +down:: + + $ nodetool status | grep -v '^UN' + Datacenter: dc1 + =============== + Status=Up/Down + |/ State=Normal/Leaving/Joining/Moving + -- Address Load Tokens Owns (effective) Host ID Rack + Datacenter: dc2 + =============== + Status=Up/Down + |/ State=Normal/Leaving/Joining/Moving + -- Address Load Tokens Owns (effective) Host ID Rack + DN 127.0.0.5 105.73 KiB 1 33.3% df303ac7-61de-46e9-ac79-6e630115fd75 r1 + +In this case there are two datacenters and there is one node down in datacenter +``dc2`` and rack ``r1``. This may indicate an issue on ``127.0.0.5`` +warranting investigation. + +.. _nodetool-proxyhistograms: + +Coordinator Query Latency +------------------------- +You can view latency distributions of coordinator read and write latency +to help narrow down latency issues using ``nodetool proxyhistograms``:: + + $ nodetool proxyhistograms + Percentile Read Latency Write Latency Range Latency CAS Read Latency CAS Write Latency View Write Latency + (micros) (micros) (micros) (micros) (micros) (micros) + 50% 454.83 219.34 0.00 0.00 0.00 0.00 + 75% 545.79 263.21 0.00 0.00 0.00 0.00 + 95% 654.95 315.85 0.00 0.00 0.00 0.00 + 98% 785.94 379.02 0.00 0.00 0.00 0.00 + 99% 3379.39 2346.80 0.00 0.00 0.00 0.00 + Min 42.51 105.78 0.00 0.00 0.00 0.00 + Max 25109.16 43388.63 0.00 0.00 0.00 0.00 + +Here you can see the full latency distribution of reads, writes, range requests +(e.g. ``select * from keyspace.table``), CAS read (compare phase of CAS) and +CAS write (set phase of compare and set). These can be useful for narrowing +down high level latency problems, for example in this case if a client had a +20 millisecond timeout on their reads they might experience the occasional +timeout from this node but less than 1% (since the 99% read latency is 3.3 +milliseconds < 20 milliseconds). + +.. _nodetool-tablehistograms: + +Local Query Latency +------------------- + +If you know which table is having latency/error issues, you can use +``nodetool tablehistograms`` to get a better idea of what is happening +locally on a node:: + + $ nodetool tablehistograms keyspace table + Percentile SSTables Write Latency Read Latency Partition Size Cell Count + (micros) (micros) (bytes) + 50% 0.00 73.46 182.79 17084 103 + 75% 1.00 88.15 315.85 17084 103 + 95% 2.00 126.93 545.79 17084 103 + 98% 2.00 152.32 654.95 17084 103 + 99% 2.00 182.79 785.94 17084 103 + Min 0.00 42.51 24.60 14238 87 + Max 2.00 12108.97 17436.92 17084 103 + +This shows you percentile breakdowns particularly critical metrics. + +The first column contains how many sstables were read per logical read. A very +high number here indicates that you may have chosen the wrong compaction +strategy, e.g. ``SizeTieredCompactionStrategy`` typically has many more reads +per read than ``LeveledCompactionStrategy`` does for update heavy workloads. + +The second column shows you a latency breakdown of *local* write latency. In +this case we see that while the p50 is quite good at 73 microseconds, the +maximum latency is quite slow at 12 milliseconds. High write max latencies +often indicate a slow commitlog volume (slow to fsync) or large writes +that quickly saturate commitlog segments. + +The third column shows you a latency breakdown of *local* read latency. We can +see that local Cassandra reads are (as expected) slower than local writes, and +the read speed correlates highly with the number of sstables read per read. + +The fourth and fifth columns show distributions of partition size and column +count per partition. These are useful for determining if the table has on +average skinny or wide partitions and can help you isolate bad data patterns. +For example if you have a single cell that is 2 megabytes, that is probably +going to cause some heap pressure when it's read. + +.. _nodetool-tpstats: + +Threadpool State +---------------- + +You can use ``nodetool tpstats`` to view the current outstanding requests on +a particular node. This is useful for trying to find out which resource +(read threads, write threads, compaction, request response threads) the +Cassandra process lacks. For example:: + + $ nodetool tpstats + Pool Name Active Pending Completed Blocked All time blocked + ReadStage 2 0 12 0 0 + MiscStage 0 0 0 0 0 + CompactionExecutor 0 0 1940 0 0 + MutationStage 0 0 0 0 0 + GossipStage 0 0 10293 0 0 + Repair-Task 0 0 0 0 0 + RequestResponseStage 0 0 16 0 0 + ReadRepairStage 0 0 0 0 0 + CounterMutationStage 0 0 0 0 0 + MemtablePostFlush 0 0 83 0 0 + ValidationExecutor 0 0 0 0 0 + MemtableFlushWriter 0 0 30 0 0 + ViewMutationStage 0 0 0 0 0 + CacheCleanupExecutor 0 0 0 0 0 + MemtableReclaimMemory 0 0 30 0 0 + PendingRangeCalculator 0 0 11 0 0 + SecondaryIndexManagement 0 0 0 0 0 + HintsDispatcher 0 0 0 0 0 + Native-Transport-Requests 0 0 192 0 0 + MigrationStage 0 0 14 0 0 + PerDiskMemtableFlushWriter_0 0 0 30 0 0 + Sampler 0 0 0 0 0 + ViewBuildExecutor 0 0 0 0 0 + InternalResponseStage 0 0 0 0 0 + AntiEntropyStage 0 0 0 0 0 + + Message type Dropped Latency waiting in queue (micros) + 50% 95% 99% Max + READ 0 N/A N/A N/A N/A + RANGE_SLICE 0 0.00 0.00 0.00 0.00 + _TRACE 0 N/A N/A N/A N/A + HINT 0 N/A N/A N/A N/A + MUTATION 0 N/A N/A N/A N/A + COUNTER_MUTATION 0 N/A N/A N/A N/A + BATCH_STORE 0 N/A N/A N/A N/A + BATCH_REMOVE 0 N/A N/A N/A N/A + REQUEST_RESPONSE 0 0.00 0.00 0.00 0.00 + PAGED_RANGE 0 N/A N/A N/A N/A + READ_REPAIR 0 N/A N/A N/A N/A + +This command shows you all kinds of interesting statistics. The first section +shows a detailed breakdown of threadpools for each Cassandra stage, including +how many threads are current executing (Active) and how many are waiting to +run (Pending). Typically if you see pending executions in a particular +threadpool that indicates a problem localized to that type of operation. For +example if the ``RequestResponseState`` queue is backing up, that means +that the coordinators are waiting on a lot of downstream replica requests and +may indicate a lack of token awareness, or very high consistency levels being +used on read requests (for example reading at ``ALL`` ties up RF +``RequestResponseState`` threads whereas ``LOCAL_ONE`` only uses a single +thread in the ``ReadStage`` threadpool). On the other hand if you see a lot of +pending compactions that may indicate that your compaction threads cannot keep +up with the volume of writes and you may need to tune either the compaction +strategy or the ``concurrent_compactors`` or ``compaction_throughput`` options. + +The second section shows drops (errors) and latency distributions for all the +major request types. Drops are cumulative since process start, but if you +have any that indicate a serious problem as the default timeouts to qualify as +a drop are quite high (~5-10 seconds). Dropped messages often warrants further +investigation. + +.. _nodetool-compactionstats: + +Compaction State +---------------- + +As Cassandra is a LSM datastore, Cassandra sometimes has to compact sstables +together, which can have adverse effects on performance. In particular, +compaction uses a reasonable quantity of CPU resources, invalidates large +quantities of the OS `page cache <https://en.wikipedia.org/wiki/Page_cache>`_, +and can put a lot of load on your disk drives. There are great +:ref:`os tools <os-iostat>` to determine if this is the case, but often it's a +good idea to check if compactions are even running using +``nodetool compactionstats``:: + + $ nodetool compactionstats + pending tasks: 2 + - keyspace.table: 2 + + id compaction type keyspace table completed total unit progress + 2062b290-7f3a-11e8-9358-cd941b956e60 Compaction keyspace table 21848273 97867583 bytes 22.32% + Active compaction remaining time : 0h00m04s + +In this case there is a single compaction running on the ``keyspace.table`` +table, has completed 21.8 megabytes of 97 and Cassandra estimates (based on +the configured compaction throughput) that this will take 4 seconds. You can +also pass ``-H`` to get the units in a human readable format. + +Generally each running compaction can consume a single core, but the more +you do in parallel the faster data compacts. Compaction is crucial to ensuring +good read performance so having the right balance of concurrent compactions +such that compactions complete quickly but don't take too many resources +away from query threads is very important for performance. If you notice +compaction unable to keep up, try tuning Cassandra's ``concurrent_compactors`` +or ``compaction_throughput`` options. http://git-wip-us.apache.org/repos/asf/cassandra/blob/57b87d21/doc/source/troubleshooting/use_tools.rst ---------------------------------------------------------------------- diff --git a/doc/source/troubleshooting/use_tools.rst b/doc/source/troubleshooting/use_tools.rst new file mode 100644 index 0000000..b1347cc --- /dev/null +++ b/doc/source/troubleshooting/use_tools.rst @@ -0,0 +1,542 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. See the License for the specific language governing permissions and +.. limitations under the License. + +.. _use-os-tools: + +Diving Deep, Use External Tools +=============================== + +Machine access allows operators to dive even deeper than logs and ``nodetool`` +allow. While every Cassandra operator may have their personal favorite +toolsets for troubleshooting issues, this page contains some of the most common +operator techniques and examples of those tools. Many of these commands work +only on Linux, but if you are deploying on a different operating system you may +have access to other substantially similar tools that assess similar OS level +metrics and processes. + +JVM Tooling +----------- +The JVM ships with a number of useful tools. Some of them are useful for +debugging Cassandra issues, especially related to heap and execution stacks. + +**NOTE**: There are two common gotchas with JVM tooling and Cassandra: + +1. By default Cassandra ships with ``-XX:+PerfDisableSharedMem`` set to prevent + long pauses (see ``CASSANDRA-9242`` and ``CASSANDRA-9483`` for details). If + you want to use JVM tooling you can instead have ``/tmp`` mounted on an in + memory ``tmpfs`` which also effectively works around ``CASSANDRA-9242``. +2. Make sure you run the tools as the same user as Cassandra is running as, + e.g. if the database is running as ``cassandra`` the tool also has to be + run as ``cassandra``, e.g. via ``sudo -u cassandra <cmd>``. + +Garbage Collection State (jstat) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +If you suspect heap pressure you can use ``jstat`` to dive deep into the +garbage collection state of a Cassandra process. This command is always +safe to run and yields detailed heap information including eden heap usage (E), +old generation heap usage (O), count of eden collections (YGC), time spend in +eden collections (YGCT), old/mixed generation collections (FGC) and time spent +in old/mixed generation collections (FGCT):: + + + jstat -gcutil <cassandra pid> 500ms + S0 S1 E O M CCS YGC YGCT FGC FGCT GCT + 0.00 0.00 81.53 31.16 93.07 88.20 12 0.151 3 0.257 0.408 + 0.00 0.00 82.36 31.16 93.07 88.20 12 0.151 3 0.257 0.408 + 0.00 0.00 82.36 31.16 93.07 88.20 12 0.151 3 0.257 0.408 + 0.00 0.00 83.19 31.16 93.07 88.20 12 0.151 3 0.257 0.408 + 0.00 0.00 83.19 31.16 93.07 88.20 12 0.151 3 0.257 0.408 + 0.00 0.00 84.19 31.16 93.07 88.20 12 0.151 3 0.257 0.408 + 0.00 0.00 84.19 31.16 93.07 88.20 12 0.151 3 0.257 0.408 + 0.00 0.00 85.03 31.16 93.07 88.20 12 0.151 3 0.257 0.408 + 0.00 0.00 85.03 31.16 93.07 88.20 12 0.151 3 0.257 0.408 + 0.00 0.00 85.94 31.16 93.07 88.20 12 0.151 3 0.257 0.408 + +In this case we see we have a relatively healthy heap profile, with 31.16% +old generation heap usage and 83% eden. If the old generation routinely is +above 75% then you probably need more heap (assuming CMS with a 75% occupancy +threshold). If you do have such persistently high old gen that often means you +either have under-provisioned the old generation heap, or that there is too +much live data on heap for Cassandra to collect (e.g. because of memtables). +Another thing to watch for is time between young garbage collections (YGC), +which indicate how frequently the eden heap is collected. Each young gc pause +is about 20-50ms, so if you have a lot of them your clients will notice in +their high percentile latencies. + +Thread Information (jstack) +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To get a point in time snapshot of exactly what Cassandra is doing, run +``jstack`` against the Cassandra PID. **Note** that this does pause the JVM for +a very brief period (<20ms).:: + + $ jstack <cassandra pid> > threaddump + + # display the threaddump + $ cat threaddump + ... + + # look at runnable threads + $grep RUNNABLE threaddump -B 1 + "Attach Listener" #15 daemon prio=9 os_prio=0 tid=0x00007f829c001000 nid=0x3a74 waiting on condition [0x0000000000000000] + java.lang.Thread.State: RUNNABLE + -- + "DestroyJavaVM" #13 prio=5 os_prio=0 tid=0x00007f82e800e000 nid=0x2a19 waiting on condition [0x0000000000000000] + java.lang.Thread.State: RUNNABLE + -- + "JPS thread pool" #10 prio=5 os_prio=0 tid=0x00007f82e84d0800 nid=0x2a2c runnable [0x00007f82d0856000] + java.lang.Thread.State: RUNNABLE + -- + "Service Thread" #9 daemon prio=9 os_prio=0 tid=0x00007f82e80d7000 nid=0x2a2a runnable [0x0000000000000000] + java.lang.Thread.State: RUNNABLE + -- + "C1 CompilerThread3" #8 daemon prio=9 os_prio=0 tid=0x00007f82e80cc000 nid=0x2a29 waiting on condition [0x0000000000000000] + java.lang.Thread.State: RUNNABLE + -- + ... + + # Note that the nid is the Linux thread id + +Some of the most important information in the threaddumps are waiting/blocking +threads, including what locks or monitors the thread is blocking/waiting on. + +Basic OS Tooling +---------------- +A great place to start when debugging a Cassandra issue is understanding how +Cassandra is interacting with system resources. The following are all +resources that Cassandra makes heavy uses of: + +* CPU cores. For executing concurrent user queries +* CPU processing time. For query activity (data decompression, row merging, + etc...) +* CPU processing time (low priority). For background tasks (compaction, + streaming, etc ...) +* RAM for Java Heap. Used to hold internal data-structures and by default the + Cassandra memtables. Heap space is a crucial component of write performance + as well as generally. +* RAM for OS disk cache. Used to cache frequently accessed SSTable blocks. OS + disk cache is a crucial component of read performance. +* Disks. Cassandra cares a lot about disk read latency, disk write throughput, + and of course disk space. +* Network latency. Cassandra makes many internode requests, so network latency + between nodes can directly impact performance. +* Network throughput. Cassandra (as other databases) frequently have the + so called "incast" problem where a small request (e.g. ``SELECT * from + foo.bar``) returns a massively large result set (e.g. the entire dataset). + In such situations outgoing bandwidth is crucial. + +Often troubleshooting Cassandra comes down to troubleshooting what resource +the machine or cluster is running out of. Then you create more of that resource +or change the query pattern to make less use of that resource. + +High Level Resource Usage (top/htop) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Cassandra makes signifiant use of system resources, and often the very first +useful action is to run ``top`` or ``htop`` (`website +<https://hisham.hm/htop/>`_)to see the state of the machine. + +Useful things to look at: + +* System load levels. While these numbers can be confusing, generally speaking + if the load average is greater than the number of CPU cores, Cassandra + probably won't have very good (sub 100 millisecond) latencies. See + `Linux Load Averages <http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html>`_ + for more information. +* CPU utilization. ``htop`` in particular can help break down CPU utilization + into ``user`` (low and normal priority), ``system`` (kernel), and ``io-wait`` + . Cassandra query threads execute as normal priority ``user`` threads, while + compaction threads execute as low priority ``user`` threads. High ``system`` + time could indicate problems like thread contention, and high ``io-wait`` + may indicate slow disk drives. This can help you understand what Cassandra + is spending processing resources doing. +* Memory usage. Look for which programs have the most resident memory, it is + probably Cassandra. The number for Cassandra is likely inaccurately high due + to how Linux (as of 2018) accounts for memory mapped file memory. + +.. _os-iostat: + +IO Usage (iostat) +^^^^^^^^^^^^^^^^^ +Use iostat to determine how data drives are faring, including latency +distributions, throughput, and utilization:: + + $ sudo iostat -xdm 2 + Linux 4.13.0-13-generic (hostname) 07/03/2018 _x86_64_ (8 CPU) + + Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util + sda 0.00 0.28 0.32 5.42 0.01 0.13 48.55 0.01 2.21 0.26 2.32 0.64 0.37 + sdb 0.00 0.00 0.00 0.00 0.00 0.00 79.34 0.00 0.20 0.20 0.00 0.16 0.00 + sdc 0.34 0.27 0.76 0.36 0.01 0.02 47.56 0.03 26.90 2.98 77.73 9.21 1.03 + + Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util + sda 0.00 0.00 2.00 32.00 0.01 4.04 244.24 0.54 16.00 0.00 17.00 1.06 3.60 + sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 + sdc 0.00 24.50 0.00 114.00 0.00 11.62 208.70 5.56 48.79 0.00 48.79 1.12 12.80 + + +In this case we can see that ``/dev/sdc1`` is a very slow drive, having an +``await`` close to 50 milliseconds and an ``avgqu-sz`` close to 5 ios. The +drive is not particularly saturated (utilization is only 12.8%), but we should +still be concerned about how this would affect our p99 latency since 50ms is +quite long for typical Cassandra operations. That being said, in this case +most of the latency is present in writes (typically writes are more latent +than reads), which due to the LSM nature of Cassandra is often hidden from +the user. + +Important metrics to assess using iostat: + +* Reads and writes per second. These numbers will change with the workload, + but generally speaking the more reads Cassandra has to do from disk the + slower Cassandra read latencies are. Large numbers of reads per second + can be a dead giveaway that the cluster has insufficient memory for OS + page caching. +* Write throughput. Cassandra's LSM model defers user writes and batches them + together, which means that throughput to the underlying medium is the most + important write metric for Cassandra. +* Read latency (``r_await``). When Cassandra missed the OS page cache and reads + from SSTables, the read latency directly determines how fast Cassandra can + respond with the data. +* Write latency. Cassandra is less sensitive to write latency except when it + syncs the commit log. This typically enters into the very high percentiles of + write latency. + +Note that to get detailed latency breakdowns you will need a more advanced +tool such as :ref:`bcc-tools <use-bcc-tools>`. + +OS page Cache Usage +^^^^^^^^^^^^^^^^^^^ +As Cassandra makes heavy use of memory mapped files, the health of the +operating system's `Page Cache <https://en.wikipedia.org/wiki/Page_cache>`_ is +crucial to performance. Start by finding how much available cache is in the +system:: + + $ free -g + total used free shared buff/cache available + Mem: 15 9 2 0 3 5 + Swap: 0 0 0 + +In this case 9GB of memory is used by user processes (Cassandra heap) and 8GB +is available for OS page cache. Of that, 3GB is actually used to cache files. +If most memory is used and unavailable to the page cache, Cassandra performance +can suffer significantly. This is why Cassandra starts with a reasonably small +amount of memory reserved for the heap. + +If you suspect that you are missing the OS page cache frequently you can use +advanced tools like :ref:`cachestat <use-bcc-tools>` or +:ref:`vmtouch <use-vmtouch>` to dive deeper. + +Network Latency and Reliability +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Whenever Cassandra does writes or reads that involve other replicas, +``LOCAL_QUORUM`` reads for example, one of the dominant effects on latency is +network latency. When trying to debug issues with multi machine operations, +the network can be an important resource to investigate. You can determine +internode latency using tools like ``ping`` and ``traceroute`` or most +effectively ``mtr``:: + + $ mtr -nr www.google.com + Start: Sun Jul 22 13:10:28 2018 + HOST: hostname Loss% Snt Last Avg Best Wrst StDev + 1.|-- 192.168.1.1 0.0% 10 2.0 1.9 1.1 3.7 0.7 + 2.|-- 96.123.29.15 0.0% 10 11.4 11.0 9.0 16.4 1.9 + 3.|-- 68.86.249.21 0.0% 10 10.6 10.7 9.0 13.7 1.1 + 4.|-- 162.141.78.129 0.0% 10 11.5 10.6 9.6 12.4 0.7 + 5.|-- 162.151.78.253 0.0% 10 10.9 12.1 10.4 20.2 2.8 + 6.|-- 68.86.143.93 0.0% 10 12.4 12.6 9.9 23.1 3.8 + 7.|-- 96.112.146.18 0.0% 10 11.9 12.4 10.6 15.5 1.6 + 9.|-- 209.85.252.250 0.0% 10 13.7 13.2 12.5 13.9 0.0 + 10.|-- 108.170.242.238 0.0% 10 12.7 12.4 11.1 13.0 0.5 + 11.|-- 74.125.253.149 0.0% 10 13.4 13.7 11.8 19.2 2.1 + 12.|-- 216.239.62.40 0.0% 10 13.4 14.7 11.5 26.9 4.6 + 13.|-- 108.170.242.81 0.0% 10 14.4 13.2 10.9 16.0 1.7 + 14.|-- 72.14.239.43 0.0% 10 12.2 16.1 11.0 32.8 7.1 + 15.|-- 216.58.195.68 0.0% 10 25.1 15.3 11.1 25.1 4.8 + +In this example of ``mtr``, we can rapidly assess the path that your packets +are taking, as well as what their typical loss and latency are. Packet loss +typically leads to between ``200ms`` and ``3s`` of additional latency, so that +can be a common cause of latency issues. + +Network Throughput +^^^^^^^^^^^^^^^^^^ +As Cassandra is sensitive to outgoing bandwidth limitations, sometimes it is +useful to determine if network throughput is limited. One handy tool to do +this is `iftop <https://www.systutorials.com/docs/linux/man/8-iftop/>`_ which +shows both bandwidth usage as well as connection information at a glance. An +example showing traffic during a stress run against a local ``ccm`` cluster:: + + $ # remove the -t for ncurses instead of pure text + $ sudo iftop -nNtP -i lo + interface: lo + IP address is: 127.0.0.1 + MAC address is: 00:00:00:00:00:00 + Listening on lo + # Host name (port/service if enabled) last 2s last 10s last 40s cumulative + -------------------------------------------------------------------------------------------- + 1 127.0.0.1:58946 => 869Kb 869Kb 869Kb 217KB + 127.0.0.3:9042 <= 0b 0b 0b 0B + 2 127.0.0.1:54654 => 736Kb 736Kb 736Kb 184KB + 127.0.0.1:9042 <= 0b 0b 0b 0B + 3 127.0.0.1:51186 => 669Kb 669Kb 669Kb 167KB + 127.0.0.2:9042 <= 0b 0b 0b 0B + 4 127.0.0.3:9042 => 3.30Kb 3.30Kb 3.30Kb 845B + 127.0.0.1:58946 <= 0b 0b 0b 0B + 5 127.0.0.1:9042 => 2.79Kb 2.79Kb 2.79Kb 715B + 127.0.0.1:54654 <= 0b 0b 0b 0B + 6 127.0.0.2:9042 => 2.54Kb 2.54Kb 2.54Kb 650B + 127.0.0.1:51186 <= 0b 0b 0b 0B + 7 127.0.0.1:36894 => 1.65Kb 1.65Kb 1.65Kb 423B + 127.0.0.5:7000 <= 0b 0b 0b 0B + 8 127.0.0.1:38034 => 1.50Kb 1.50Kb 1.50Kb 385B + 127.0.0.2:7000 <= 0b 0b 0b 0B + 9 127.0.0.1:56324 => 1.50Kb 1.50Kb 1.50Kb 383B + 127.0.0.1:7000 <= 0b 0b 0b 0B + 10 127.0.0.1:53044 => 1.43Kb 1.43Kb 1.43Kb 366B + 127.0.0.4:7000 <= 0b 0b 0b 0B + -------------------------------------------------------------------------------------------- + Total send rate: 2.25Mb 2.25Mb 2.25Mb + Total receive rate: 0b 0b 0b + Total send and receive rate: 2.25Mb 2.25Mb 2.25Mb + -------------------------------------------------------------------------------------------- + Peak rate (sent/received/total): 2.25Mb 0b 2.25Mb + Cumulative (sent/received/total): 576KB 0B 576KB + ============================================================================================ + +In this case we can see that bandwidth is fairly shared between many peers, +but if the total was getting close to the rated capacity of the NIC or was focussed +on a single client, that may indicate a clue as to what issue is occurring. + +Advanced tools +-------------- +Sometimes as an operator you may need to really dive deep. This is where +advanced OS tooling can come in handy. + +.. _use-bcc-tools: + +bcc-tools +^^^^^^^^^ +Most modern Linux distributions (kernels newer than ``4.1``) support `bcc-tools +<https://github.com/iovisor/bcc>`_ for diving deep into performance problems. +First install ``bcc-tools``, e.g. via ``apt`` on Debian:: + + $ apt install bcc-tools + +Then you can use all the tools that ``bcc-tools`` contains. One of the most +useful tools is ``cachestat`` +(`cachestat examples <https://github.com/iovisor/bcc/blob/master/tools/cachestat_example.txt>`_) +which allows you to determine exactly how many OS page cache hits and misses +are happening:: + + $ sudo /usr/share/bcc/tools/cachestat -T 1 + TIME TOTAL MISSES HITS DIRTIES BUFFERS_MB CACHED_MB + 18:44:08 66 66 0 64 88 4427 + 18:44:09 40 40 0 75 88 4427 + 18:44:10 4353 45 4308 203 88 4427 + 18:44:11 84 77 7 13 88 4428 + 18:44:12 2511 14 2497 14 88 4428 + 18:44:13 101 98 3 18 88 4428 + 18:44:14 16741 0 16741 58 88 4428 + 18:44:15 1935 36 1899 18 88 4428 + 18:44:16 89 34 55 18 88 4428 + +In this case there are not too many page cache ``MISSES`` which indicates a +reasonably sized cache. These metrics are the most direct measurement of your +Cassandra node's "hot" dataset. If you don't have enough cache, ``MISSES`` will +be high and performance will be slow. If you have enough cache, ``MISSES`` will +be low and performance will be fast (as almost all reads are being served out +of memory). + +You can also measure disk latency distributions using ``biolatency`` +(`biolatency examples <https://github.com/iovisor/bcc/blob/master/tools/biolatency_example.txt>`_) +to get an idea of how slow Cassandra will be when reads miss the OS page Cache +and have to hit disks:: + + $ sudo /usr/share/bcc/tools/biolatency -D 10 + Tracing block device I/O... Hit Ctrl-C to end. + + + disk = 'sda' + usecs : count distribution + 0 -> 1 : 0 | | + 2 -> 3 : 0 | | + 4 -> 7 : 0 | | + 8 -> 15 : 0 | | + 16 -> 31 : 12 |****************************************| + 32 -> 63 : 9 |****************************** | + 64 -> 127 : 1 |*** | + 128 -> 255 : 3 |********** | + 256 -> 511 : 7 |*********************** | + 512 -> 1023 : 2 |****** | + + disk = 'sdc' + usecs : count distribution + 0 -> 1 : 0 | | + 2 -> 3 : 0 | | + 4 -> 7 : 0 | | + 8 -> 15 : 0 | | + 16 -> 31 : 0 | | + 32 -> 63 : 0 | | + 64 -> 127 : 41 |************ | + 128 -> 255 : 17 |***** | + 256 -> 511 : 13 |*** | + 512 -> 1023 : 2 | | + 1024 -> 2047 : 0 | | + 2048 -> 4095 : 0 | | + 4096 -> 8191 : 56 |***************** | + 8192 -> 16383 : 131 |****************************************| + 16384 -> 32767 : 9 |** | + +In this case most ios on the data drive (``sdc``) are fast, but many take +between 8 and 16 milliseconds. + +Finally ``biosnoop`` (`examples <https://github.com/iovisor/bcc/blob/master/tools/biosnoop_example.txt>`_) +can be used to dive even deeper and see per IO latencies:: + + $ sudo /usr/share/bcc/tools/biosnoop | grep java | head + 0.000000000 java 17427 sdc R 3972458600 4096 13.58 + 0.000818000 java 17427 sdc R 3972459408 4096 0.35 + 0.007098000 java 17416 sdc R 3972401824 4096 5.81 + 0.007896000 java 17416 sdc R 3972489960 4096 0.34 + 0.008920000 java 17416 sdc R 3972489896 4096 0.34 + 0.009487000 java 17427 sdc R 3972401880 4096 0.32 + 0.010238000 java 17416 sdc R 3972488368 4096 0.37 + 0.010596000 java 17427 sdc R 3972488376 4096 0.34 + 0.011236000 java 17410 sdc R 3972488424 4096 0.32 + 0.011825000 java 17427 sdc R 3972488576 16384 0.65 + ... time passes + 8.032687000 java 18279 sdc R 10899712 122880 3.01 + 8.033175000 java 18279 sdc R 10899952 8192 0.46 + 8.073295000 java 18279 sdc R 23384320 122880 3.01 + 8.073768000 java 18279 sdc R 23384560 8192 0.46 + + +With ``biosnoop`` you see every single IO and how long they take. This data +can be used to construct the latency distributions in ``biolatency`` but can +also be used to better understand how disk latency affects performance. For +example this particular drive takes ~3ms to service a memory mapped read due to +the large default value (``128kb``) of ``read_ahead_kb``. To improve point read +performance you may may want to decrease ``read_ahead_kb`` on fast data volumes +such as SSDs while keeping the a higher value like ``128kb`` value is probably +right for HDs. There are tradeoffs involved, see `queue-sysfs +<https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt>`_ docs for more +information, but regardless ``biosnoop`` is useful for understanding *how* +Cassandra uses drives. + +.. _use-vmtouch: + +vmtouch +^^^^^^^ +Sometimes it's useful to know how much of the Cassandra data files are being +cached by the OS. A great tool for answering this question is +`vmtouch <https://github.com/hoytech/vmtouch>`_. + +First install it:: + + $ git clone https://github.com/hoytech/vmtouch.git + $ cd vmtouch + $ make + +Then run it on the Cassandra data directory:: + + $ ./vmtouch /var/lib/cassandra/data/ + Files: 312 + Directories: 92 + Resident Pages: 62503/64308 244M/251M 97.2% + Elapsed: 0.005657 seconds + +In this case almost the entire dataset is hot in OS page Cache. Generally +speaking the percentage doesn't really matter unless reads are missing the +cache (per e.g. :ref:`cachestat <use-bcc-tools>`), in which case having +additional memory may help read performance. + +CPU Flamegraphs +^^^^^^^^^^^^^^^ +Cassandra often uses a lot of CPU, but telling *what* it is doing can prove +difficult. One of the best ways to analyze Cassandra on CPU time is to use +`CPU Flamegraphs <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html>`_ +which display in a useful way which areas of Cassandra code are using CPU. This +may help narrow down a compaction problem to a "compaction problem dropping +tombstones" or just generally help you narrow down what Cassandra is doing +while it is having an issue. To get CPU flamegraphs follow the instructions for +`Java Flamegraphs +<http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java>`_. + +Generally: + +1. Enable the ``-XX:+PreserveFramePointer`` option in Cassandra's + ``jvm.options`` configuation file. This has a negligible performance impact + but allows you actually see what Cassandra is doing. +2. Run ``perf`` to get some data. +3. Send that data through the relevant scripts in the FlameGraph toolset and + convert the data into a pretty flamegraph. View the resulting SVG image in + a browser or other image browser. + +For example just cloning straight off github we first install the +``perf-map-agent`` to the location of our JVMs (assumed to be +``/usr/lib/jvm``):: + + $ sudo bash + $ export JAVA_HOME=/usr/lib/jvm/java-8-oracle/ + $ cd /usr/lib/jvm + $ git clone --depth=1 https://github.com/jvm-profiling-tools/perf-map-agent + $ cd perf-map-agent + $ cmake . + $ make + +Now to get a flamegraph:: + + $ git clone --depth=1 https://github.com/brendangregg/FlameGraph + $ sudo bash + $ cd FlameGraph + $ # Record traces of Cassandra and map symbols for all java processes + $ perf record -F 49 -a -g -p <CASSANDRA PID> -- sleep 30; ./jmaps + $ # Translate the data + $ perf script > cassandra_stacks + $ cat cassandra_stacks | ./stackcollapse-perf.pl | grep -v cpu_idle | \ + ./flamegraph.pl --color=java --hash > cassandra_flames.svg + + +The resulting SVG is searchable, zoomable, and generally easy to introspect +using a browser. + +.. _packet-capture: + +Packet Capture +^^^^^^^^^^^^^^ +Sometimes you have to understand what queries a Cassandra node is performing +*right now* to troubleshoot an issue. For these times trusty packet capture +tools like ``tcpdump`` and `Wireshark +<https://www.wireshark.org/>`_ can be very helpful to dissect packet captures. +Wireshark even has native `CQL support +<https://www.wireshark.org/docs/dfref/c/cql.html>`_ although it sometimes has +compatibility issues with newer Cassandra protocol releases. + +To get a packet capture first capture some packets:: + + $ sudo tcpdump -U -s0 -i <INTERFACE> -w cassandra.pcap -n "tcp port 9042" + +Now open it up with wireshark:: + + $ wireshark cassandra.pcap + +If you don't see CQL like statements try telling to decode as CQL by right +clicking on a packet going to 9042 -> ``Decode as`` -> select CQL from the +dropdown for port 9042. + +If you don't want to do this manually or use a GUI, you can also use something +like `cqltrace <https://github.com/jolynch/cqltrace>`_ to ease obtaining and +parsing CQL packet captures. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
