[cassandra] branch trunk updated: Added production recommendations and improved compaction doc organization.

rustyrazorblade Mon, 30 Mar 2020 14:11:02 -0700

This is an automated email from the ASF dual-hosted git repository.

rustyrazorblade pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git



The following commit(s) were added to refs/heads/trunk by this push:
     new 4dcee71  Added production recommendations and improved compaction doc 
organization.
4dcee71 is described below

commit 4dcee71b2de36b52feddf720564f6e58ad8fcc6d
Author: Jon Haddad [email protected] <[email protected]>
AuthorDate: Mon Mar 9 16:46:03 2020 -0700

    Added production recommendations and improved compaction doc organization.
    
    Each compaction strategy gets its own page.
    No longer recommending m1 instances in AWS.
    
    Patch by Jon Haddad; Reviewed by Jordan West for CASSANDRA-15618.
---
 CHANGES.txt                                        |   1 +
 doc/source/getting_started/index.rst               |   1 +
 doc/source/getting_started/production.rst          | 156 +++++++++++++++++
 .../{compaction.rst => compaction/index.rst}       | 184 +++------------------
 doc/source/operating/compaction/lcs.rst            |  90 ++++++++++
 doc/source/operating/compaction/stcs.rst           |  58 +++++++
 doc/source/operating/compaction/twcs.rst           |  76 +++++++++
 doc/source/operating/hardware.rst                  |   2 -
 doc/source/operating/index.rst                     |   2 +-
 9 files changed, 402 insertions(+), 168 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index 4cf5421..3057ca6 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 4.0-alpha4
+ * Added production recommendations and improved compaction doc organization
  * Document usage of EC2Snitch with intra-region VPC peering (CASSANDRA-15337)
  * Fixed flakey test in SASIIndexTest by shutting down its ExecutorService 
(CASSANDRA-15528)
  * Fixed empty check in TrieMemIndex due to potential state inconsistency in 
ConcurrentSkipListMap (CASSANDRA-15526)
diff --git a/doc/source/getting_started/index.rst 
b/doc/source/getting_started/index.rst
index 4ca9c4d..a699aee 100644
--- a/doc/source/getting_started/index.rst
+++ b/doc/source/getting_started/index.rst
@@ -29,5 +29,6 @@ Cassandra.
    configuring
    querying
    drivers
+   production
 
 
diff --git a/doc/source/getting_started/production.rst 
b/doc/source/getting_started/production.rst
new file mode 100644
index 0000000..fe0c4a5
--- /dev/null
+++ b/doc/source/getting_started/production.rst
@@ -0,0 +1,156 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+..
+..     http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS,
+.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+.. See the License for the specific language governing permissions and
+.. limitations under the License.
+
+Production Recommendations
+----------------------------
+
+The ``cassandra.yaml`` and ``jvm.options`` files have a number of notes and 
recommendations for production usage.  This page
+expands on some of the notes in these files with additional information.
+
+Tokens
+^^^^^^^
+
+Using more than 1 token (referred to as vnodes) allows for more flexible 
expansion and more streaming peers when
+bootstrapping new nodes into the cluster.  This can limit the negative impact 
of streaming (I/O and CPU overhead)
+as well as allow for incremental cluster expansion.
+
+As a tradeoff, more tokens will lead to sharing data with more peers, which 
can result in decreased availability.  To learn more about this we
+recommend reading `this paper 
<https://github.com/jolynch/python_performance_toolkit/raw/master/notebooks/cassandra_availability/whitepaper/cassandra-availability-virtual.pdf>`_.
+
+The number of tokens can be changed using the following setting:
+
+``num_tokens: 16``
+
+
+Here are the most common token counts with a brief explanation of when and why 
you would use each one.
+
++-------------+---------------------------------------------------------------------------------------------------+
+| Token Count | Description                                                    
                                   |
++=============+===================================================================================================+
+| 1           | Maximum availablility, maximum cluster size, fewest peers,     
                                   |
+|             | but inflexible expansion.  Must always                         
                                   |
+|             | double size of cluster to expand and remain balanced.          
                                   |
++-------------+---------------------------------------------------------------------------------------------------+
+| 4           | A healthy mix of elasticity and availability.  Recommended for 
clusters which will eventually     |
+|             | reach over 30 nodes.  Requires adding approximately 20% more 
nodes to remain balanced.            |
+|             | Shrinking a cluster may result in cluster imbalance.           
                                   |
++-------------+---------------------------------------------------------------------------------------------------+
+| 16          | Best for heavily elastic clusters which expand and shrink 
regularly, but may have issues          |
+|             | availability with larger clusters.  Not recommended for 
clusters over 50 nodes.                   |
++-------------+---------------------------------------------------------------------------------------------------+
+
+
+In addition to setting the token count, it's extremely important that 
``allocate_tokens_for_local_replication_factor`` be
+set as well, to ensure even token allocation.
+
+.. _read-ahead:
+
+Read Ahead
+^^^^^^^^^^^
+
+Read ahead is an operating system feature that attempts to keep as much data 
loaded in the page cache as possible.  The
+goal is to decrease latency by using additional throughput on reads where the 
latency penalty is high due to seek times
+on spinning disks.  By leveraging read ahead, the OS can pull additional data 
into memory without the cost of additional
+seeks.  This works well when available RAM is greater than the size of the hot 
dataset, but can be problematic when the
+hot dataset is much larger than available RAM.  The benefit of read ahead 
decreases as the size of your hot dataset gets
+bigger in proportion to available memory.
+
+With small partitions (usually tables with no partition key, but not limited 
to this case) and solid state drives, read
+ahead can increase disk usage without any of the latency benefits, and in some 
cases can result in up to
+a 5x latency and throughput performance penalty.  Read heavy, key/value tables 
with small (under 1KB) rows are especially
+prone to this problem.
+
+We recommend the following read ahead settings:
+
++----------------+-------------------------+
+| Hardware       | Initial Recommendation  |
++================+=========================+
+|Spinning Disks  | 64KB                    |
++----------------+-------------------------+
+|SSD             | 4KB                     |
++----------------+-------------------------+
+
+Read ahead can be adjusted on Linux systems by using the `blockdev` tool.
+
+For example, we can set read ahead of ``/dev/sda1` to 4KB by doing the 
following::
+
+    blockdev --setra 8 /dev/sda1
+
+**Note**: blockdev accepts the number of 512 byte sectors to read ahead.  The 
argument of 8 above is equivilent to 4KB.
+
+Since each system is different, use the above recommendations as a starting 
point and tuning based on your SLA and
+throughput requirements.  To understand how read ahead impacts disk resource 
usage we recommend carefully reading through the
+:ref:`troubleshooting <use-os-tools>` portion of the documentation.
+
+
+Compression
+^^^^^^^^^^^^
+
+Compressed data is stored by compressing fixed size byte buffers and writing 
the data to disk.  The buffer size is
+determined by the  ``chunk_length_in_kb`` element in the compression map of 
the schema settings.
+
+The default setting is 16KB starting with Cassandra 4.0.
+
+Since the entire compressed buffer must be read off disk, using too high of a 
compression chunk length can lead to
+significant overhead when reading small records.  Combined with the default 
read ahead setting this can result in massive
+read amplification for certain workloads.
+
+LZ4Compressor is the default and recommended compression algorithm.
+
+There is additional information on this topic on `The Last Pickle Blog 
<https://thelastpickle.com/blog/2018/08/08/compression_performance.html>`_.
+
+Compaction
+^^^^^^^^^^^^
+
+There are different :ref:`compaction <compaction>` strategies available for 
different workloads.
+We recommend reading up on the different strategies to understand which is the 
best for your environment.  Different tables
+may (and frequently do) use different compaction strategies on the same 
cluster.
+
+Encryption
+^^^^^^^^^^^
+
+It is significantly easier to set up peer to peer encryption and client server 
encryption when setting up your production
+cluster as opposed to setting it up once the cluster is already serving 
production traffic.  If you are planning on using network encryption
+eventually (in any form), we recommend setting it up now.  Changing these 
configurations down the line is not impossible,
+but mistakes can result in downtime or data loss.
+
+Ensure Keyspaces are Created with NetworkTopologyStrategy
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Production clusters should never use SimpleStrategy.  Production keyspaces 
should use the NetworkTopologyStrategy (NTS).
+
+For example::
+
+    create KEYSPACE mykeyspace WITH replication =
+    {'class': 'NetworkTopologyStrategy', 'datacenter1': 3};
+
+NetworkTopologyStrategy allows Cassandra to take advantage of multiple racks 
and data centers.
+
+Configure Racks and Snitch
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**Correctly configuring or changing racks after a cluster has been provisioned 
is an unsupported process**.  Migrating from
+a single rack to multiple racks is also unsupported and can result in data 
loss.
+
+Using ``GossipingPropertyFileSnitch`` is the most flexible solution for on 
premise or mixed cloud environments.  ``Ec2Snitch``
+is reliable for AWS EC2 only environments.
+
+
+
+
+
+
+
diff --git a/doc/source/operating/compaction.rst 
b/doc/source/operating/compaction/index.rst
similarity index 59%
rename from doc/source/operating/compaction.rst
rename to doc/source/operating/compaction/index.rst
index ace9aa9..ea505dd 100644
--- a/doc/source/operating/compaction.rst
+++ b/doc/source/operating/compaction/index.rst
@@ -21,6 +21,25 @@
 Compaction
 ----------
 
+Strategies
+^^^^^^^^^^
+
+Picking the right compaction strategy for your workload will ensure the best 
performance for both querying and for compaction itself.
+
+:ref:`Size Tiered Compaction Strategy <stcs>`
+    The default compaction strategy.  Useful as a fallback when other 
strategies don't fit the workload.  Most useful for
+    non pure time series workloads with spinning disks, or when the I/O from 
:ref:`LCS <lcs>` is too high.
+
+
+:ref:`Leveled Compaction Strategy <lcs>`
+    Leveled Compaction Strategy (LCS) is optimized for read heavy workloads, 
or workloads with lots of updates and deletes.  It is not a good choice for 
immutable time series data.
+
+
+:ref:`Time Window Compaction Strategy <twcs>`
+    Time Window Compaction Strategy is designed for TTL'ed, mostly immutable 
time series data.
+
+
+
 Types of compaction
 ^^^^^^^^^^^^^^^^^^^
 
@@ -276,172 +295,7 @@ More detailed compaction logging
 Enable with the compaction option ``log_all`` and a more detailed compaction 
log file will be produced in your log
 directory.
 
-.. _STCS:
 
-Size Tiered Compaction Strategy
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The basic idea of ``SizeTieredCompactionStrategy`` (STCS) is to merge sstables 
of approximately the same size. All
-sstables are put in different buckets depending on their size. An sstable is 
added to the bucket if size of the sstable
-is within ``bucket_low`` and ``bucket_high`` of the current average size of 
the sstables already in the bucket. This
-will create several buckets and the most interesting of those buckets will be 
compacted. The most interesting one is
-decided by figuring out which bucket's sstables takes the most reads.
-
-Major compaction
-~~~~~~~~~~~~~~~~
 
-When running a major compaction with STCS you will end up with two sstables 
per data directory (one for repaired data
-and one for unrepaired data). There is also an option (-s) to do a major 
compaction that splits the output into several
-sstables. The sizes of the sstables are approximately 50%, 25%, 12.5%... of 
the total size.
 
-.. _stcs-options:
-
-STCS options
-~~~~~~~~~~~~
-
-``min_sstable_size`` (default: 50MB)
-    Sstables smaller than this are put in the same bucket.
-``bucket_low`` (default: 0.5)
-    How much smaller than the average size of a bucket a sstable should be 
before not being included in the bucket. That
-    is, if ``bucket_low * avg_bucket_size < sstable_size`` (and the 
``bucket_high`` condition holds, see below), then
-    the sstable is added to the bucket.
-``bucket_high`` (default: 1.5)
-    How much bigger than the average size of a bucket a sstable should be 
before not being included in the bucket. That
-    is, if ``sstable_size < bucket_high * avg_bucket_size`` (and the 
``bucket_low`` condition holds, see above), then
-    the sstable is added to the bucket.
-
-Defragmentation
-~~~~~~~~~~~~~~~
-
-Defragmentation is done when many sstables are touched during a read.  The 
result of the read is put in to the memtable
-so that the next read will not have to touch as many sstables. This can cause 
writes on a read-only-cluster.
-
-.. _LCS:
-
-Leveled Compaction Strategy
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The idea of ``LeveledCompactionStrategy`` (LCS) is that all sstables are put 
into different levels where we guarantee
-that no overlapping sstables are in the same level. By overlapping we mean 
that the first/last token of a single sstable
-are never overlapping with other sstables. This means that for a SELECT we 
will only have to look for the partition key
-in a single sstable per level. Each level is 10x the size of the previous one 
and each sstable is 160MB by default. L0
-is where sstables are streamed/flushed - no overlap guarantees are given here.
-
-When picking compaction candidates we have to make sure that the compaction 
does not create overlap in the target level.
-This is done by always including all overlapping sstables in the next level. 
For example if we select an sstable in L3,
-we need to guarantee that we pick all overlapping sstables in L4 and make sure 
that no currently ongoing compactions
-will create overlap if we start that compaction. We can start many parallel 
compactions in a level if we guarantee that
-we wont create overlap. For L0 -> L1 compactions we almost always need to 
include all L1 sstables since most L0 sstables
-cover the full range. We also can't compact all L0 sstables with all L1 
sstables in a single compaction since that can
-use too much memory.
-
-When deciding which level to compact LCS checks the higher levels first (with 
LCS, a "higher" level is one with a higher
-number, L0 being the lowest one) and if the level is behind a compaction will 
be started in that level.
-
-Major compaction
-~~~~~~~~~~~~~~~~
 
-It is possible to do a major compaction with LCS - it will currently start by 
filling out L1 and then once L1 is full,
-it continues with L2 etc. This is sub optimal and will change to create all 
the sstables in a high level instead,
-CASSANDRA-11817.
-
-Bootstrapping
-~~~~~~~~~~~~~
-
-During bootstrap sstables are streamed from other nodes. The level of the 
remote sstable is kept to avoid many
-compactions after the bootstrap is done. During bootstrap the new node also 
takes writes while it is streaming the data
-from a remote node - these writes are flushed to L0 like all other writes and 
to avoid those sstables blocking the
-remote sstables from going to the correct level, we only do STCS in L0 until 
the bootstrap is done.
-
-STCS in L0
-~~~~~~~~~~
-
-If LCS gets very many L0 sstables reads are going to hit all (or most) of the 
L0 sstables since they are likely to be
-overlapping. To more quickly remedy this LCS does STCS compactions in L0 if 
there are more than 32 sstables there. This
-should improve read performance more quickly compared to letting LCS do its L0 
-> L1 compactions. If you keep getting
-too many sstables in L0 it is likely that LCS is not the best fit for your 
workload and STCS could work out better.
-
-Starved sstables
-~~~~~~~~~~~~~~~~
-
-If a node ends up with a leveling where there are a few very high level 
sstables that are not getting compacted they
-might make it impossible for lower levels to drop tombstones etc. For example, 
if there are sstables in L6 but there is
-only enough data to actually get a L4 on the node the left over sstables in L6 
will get starved and not compacted.  This
-can happen if a user changes sstable\_size\_in\_mb from 5MB to 160MB for 
example. To avoid this LCS tries to include
-those starved high level sstables in other compactions if there has been 25 
compaction rounds where the highest level
-has not been involved.
-
-.. _lcs-options:
-
-LCS options
-~~~~~~~~~~~
-
-``sstable_size_in_mb`` (default: 160MB)
-    The target compressed (if using compression) sstable size - the sstables 
can end up being larger if there are very
-    large partitions on the node.
-
-``fanout_size`` (default: 10)
-    The target size of levels increases by this fanout_size multiplier. You 
can reduce the space amplification by tuning
-    this option.
-
-LCS also support the ``cassandra.disable_stcs_in_l0`` startup option 
(``-Dcassandra.disable_stcs_in_l0=true``) to avoid
-doing STCS in L0.
-
-.. _TWCS:
-
-Time Window CompactionStrategy
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-``TimeWindowCompactionStrategy`` (TWCS) is designed specifically for workloads 
where it's beneficial to have data on
-disk grouped by the timestamp of the data, a common goal when the workload is 
time-series in nature or when all data is
-written with a TTL. In an expiring/TTL workload, the contents of an entire 
SSTable likely expire at approximately the
-same time, allowing them to be dropped completely, and space reclaimed much 
more reliably than when using
-``SizeTieredCompactionStrategy`` or ``LeveledCompactionStrategy``. The basic 
concept is that
-``TimeWindowCompactionStrategy`` will create 1 sstable per file for a given 
window, where a window is simply calculated
-as the combination of two primary options:
-
-``compaction_window_unit`` (default: DAYS)
-    A Java TimeUnit (MINUTES, HOURS, or DAYS).
-``compaction_window_size`` (default: 1)
-    The number of units that make up a window.
-``unsafe_aggressive_sstable_expiration`` (default: false)
-    Expired sstables will be dropped without checking its data is shadowing 
other sstables. This is a potentially
-    risky option that can lead to data loss or deleted data re-appearing, 
going beyond what
-    `unchecked_tombstone_compaction` does for single  sstable compaction. Due 
to the risk the jvm must also be
-    started with `-Dcassandra.unsafe_aggressive_sstable_expiration=true`.
-
-Taken together, the operator can specify windows of virtually any size, and 
`TimeWindowCompactionStrategy` will work to
-create a single sstable for writes within that window. For efficiency during 
writing, the newest window will be
-compacted using `SizeTieredCompactionStrategy`.
-
-Ideally, operators should select a ``compaction_window_unit`` and 
``compaction_window_size`` pair that produces
-approximately 20-30 windows - if writing with a 90 day TTL, for example, a 3 
Day window would be a reasonable choice
-(``'compaction_window_unit':'DAYS','compaction_window_size':3``).
-
-TimeWindowCompactionStrategy Operational Concerns
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The primary motivation for TWCS is to separate data on disk by timestamp and 
to allow fully expired SSTables to drop
-more efficiently. One potential way this optimal behavior can be subverted is 
if data is written to SSTables out of
-order, with new data and old data in the same SSTable. Out of order data can 
appear in two ways:
-
-- If the user mixes old data and new data in the traditional write path, the 
data will be comingled in the memtables
-  and flushed into the same SSTable, where it will remain comingled.
-- If the user's read requests for old data cause read repairs that pull old 
data into the current memtable, that data
-  will be comingled and flushed into the same SSTable.
-
-While TWCS tries to minimize the impact of comingled data, users should 
attempt to avoid this behavior.  Specifically,
-users should avoid queries that explicitly set the timestamp via CQL ``USING 
TIMESTAMP``. Additionally, users should run
-frequent repairs (which streams data in such a way that it does not become 
comingled).
-
-Changing TimeWindowCompactionStrategy Options
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Operators wishing to enable ``TimeWindowCompactionStrategy`` on existing data 
should consider running a major compaction
-first, placing all existing data into a single (old) window. Subsequent newer 
writes will then create typical SSTables
-as expected.
-
-Operators wishing to change ``compaction_window_unit`` or 
``compaction_window_size`` can do so, but may trigger
-additional compactions as adjacent windows are joined together. If the window 
size is decrease d (for example, from 24
-hours to 12 hours), then the existing SSTables will not be modified - TWCS can 
not split existing SSTables into multiple
-windows.
diff --git a/doc/source/operating/compaction/lcs.rst 
b/doc/source/operating/compaction/lcs.rst
new file mode 100644
index 0000000..48c282e
--- /dev/null
+++ b/doc/source/operating/compaction/lcs.rst
@@ -0,0 +1,90 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+..
+..     http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS,
+.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+.. See the License for the specific language governing permissions and
+.. limitations under the License.
+
+
+
+.. _LCS:
+
+Leveled Compaction Strategy
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The idea of ``LeveledCompactionStrategy`` (LCS) is that all sstables are put 
into different levels where we guarantee
+that no overlapping sstables are in the same level. By overlapping we mean 
that the first/last token of a single sstable
+are never overlapping with other sstables. This means that for a SELECT we 
will only have to look for the partition key
+in a single sstable per level. Each level is 10x the size of the previous one 
and each sstable is 160MB by default. L0
+is where sstables are streamed/flushed - no overlap guarantees are given here.
+
+When picking compaction candidates we have to make sure that the compaction 
does not create overlap in the target level.
+This is done by always including all overlapping sstables in the next level. 
For example if we select an sstable in L3,
+we need to guarantee that we pick all overlapping sstables in L4 and make sure 
that no currently ongoing compactions
+will create overlap if we start that compaction. We can start many parallel 
compactions in a level if we guarantee that
+we wont create overlap. For L0 -> L1 compactions we almost always need to 
include all L1 sstables since most L0 sstables
+cover the full range. We also can't compact all L0 sstables with all L1 
sstables in a single compaction since that can
+use too much memory.
+
+When deciding which level to compact LCS checks the higher levels first (with 
LCS, a "higher" level is one with a higher
+number, L0 being the lowest one) and if the level is behind a compaction will 
be started in that level.
+
+Major compaction
+~~~~~~~~~~~~~~~~
+
+It is possible to do a major compaction with LCS - it will currently start by 
filling out L1 and then once L1 is full,
+it continues with L2 etc. This is sub optimal and will change to create all 
the sstables in a high level instead,
+CASSANDRA-11817.
+
+Bootstrapping
+~~~~~~~~~~~~~
+
+During bootstrap sstables are streamed from other nodes. The level of the 
remote sstable is kept to avoid many
+compactions after the bootstrap is done. During bootstrap the new node also 
takes writes while it is streaming the data
+from a remote node - these writes are flushed to L0 like all other writes and 
to avoid those sstables blocking the
+remote sstables from going to the correct level, we only do STCS in L0 until 
the bootstrap is done.
+
+STCS in L0
+~~~~~~~~~~
+
+If LCS gets very many L0 sstables reads are going to hit all (or most) of the 
L0 sstables since they are likely to be
+overlapping. To more quickly remedy this LCS does STCS compactions in L0 if 
there are more than 32 sstables there. This
+should improve read performance more quickly compared to letting LCS do its L0 
-> L1 compactions. If you keep getting
+too many sstables in L0 it is likely that LCS is not the best fit for your 
workload and STCS could work out better.
+
+Starved sstables
+~~~~~~~~~~~~~~~~
+
+If a node ends up with a leveling where there are a few very high level 
sstables that are not getting compacted they
+might make it impossible for lower levels to drop tombstones etc. For example, 
if there are sstables in L6 but there is
+only enough data to actually get a L4 on the node the left over sstables in L6 
will get starved and not compacted.  This
+can happen if a user changes sstable\_size\_in\_mb from 5MB to 160MB for 
example. To avoid this LCS tries to include
+those starved high level sstables in other compactions if there has been 25 
compaction rounds where the highest level
+has not been involved.
+
+.. _lcs-options:
+
+LCS options
+~~~~~~~~~~~
+
+``sstable_size_in_mb`` (default: 160MB)
+    The target compressed (if using compression) sstable size - the sstables 
can end up being larger if there are very
+    large partitions on the node.
+
+``fanout_size`` (default: 10)
+    The target size of levels increases by this fanout_size multiplier. You 
can reduce the space amplification by tuning
+    this option.
+
+LCS also support the ``cassandra.disable_stcs_in_l0`` startup option 
(``-Dcassandra.disable_stcs_in_l0=true``) to avoid
+doing STCS in L0.
+
+
diff --git a/doc/source/operating/compaction/stcs.rst 
b/doc/source/operating/compaction/stcs.rst
new file mode 100644
index 0000000..6589337
--- /dev/null
+++ b/doc/source/operating/compaction/stcs.rst
@@ -0,0 +1,58 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+..
+..     http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS,
+.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+.. See the License for the specific language governing permissions and
+.. limitations under the License.
+
+
+.. _STCS:
+
+Leveled Compaction Strategy
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The basic idea of ``SizeTieredCompactionStrategy`` (STCS) is to merge sstables 
of approximately the same size. All
+sstables are put in different buckets depending on their size. An sstable is 
added to the bucket if size of the sstable
+is within ``bucket_low`` and ``bucket_high`` of the current average size of 
the sstables already in the bucket. This
+will create several buckets and the most interesting of those buckets will be 
compacted. The most interesting one is
+decided by figuring out which bucket's sstables takes the most reads.
+
+Major compaction
+~~~~~~~~~~~~~~~~
+
+When running a major compaction with STCS you will end up with two sstables 
per data directory (one for repaired data
+and one for unrepaired data). There is also an option (-s) to do a major 
compaction that splits the output into several
+sstables. The sizes of the sstables are approximately 50%, 25%, 12.5%... of 
the total size.
+
+.. _stcs-options:
+
+STCS options
+~~~~~~~~~~~~
+
+``min_sstable_size`` (default: 50MB)
+    Sstables smaller than this are put in the same bucket.
+``bucket_low`` (default: 0.5)
+    How much smaller than the average size of a bucket a sstable should be 
before not being included in the bucket. That
+    is, if ``bucket_low * avg_bucket_size < sstable_size`` (and the 
``bucket_high`` condition holds, see below), then
+    the sstable is added to the bucket.
+``bucket_high`` (default: 1.5)
+    How much bigger than the average size of a bucket a sstable should be 
before not being included in the bucket. That
+    is, if ``sstable_size < bucket_high * avg_bucket_size`` (and the 
``bucket_low`` condition holds, see above), then
+    the sstable is added to the bucket.
+
+Defragmentation
+~~~~~~~~~~~~~~~
+
+Defragmentation is done when many sstables are touched during a read.  The 
result of the read is put in to the memtable
+so that the next read will not have to touch as many sstables. This can cause 
writes on a read-only-cluster.
+
+
diff --git a/doc/source/operating/compaction/twcs.rst 
b/doc/source/operating/compaction/twcs.rst
new file mode 100644
index 0000000..3641a5a
--- /dev/null
+++ b/doc/source/operating/compaction/twcs.rst
@@ -0,0 +1,76 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+..
+..     http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS,
+.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+.. See the License for the specific language governing permissions and
+.. limitations under the License.
+
+
+.. _TWCS:
+
+Time Window CompactionStrategy
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``TimeWindowCompactionStrategy`` (TWCS) is designed specifically for workloads 
where it's beneficial to have data on
+disk grouped by the timestamp of the data, a common goal when the workload is 
time-series in nature or when all data is
+written with a TTL. In an expiring/TTL workload, the contents of an entire 
SSTable likely expire at approximately the
+same time, allowing them to be dropped completely, and space reclaimed much 
more reliably than when using
+``SizeTieredCompactionStrategy`` or ``LeveledCompactionStrategy``. The basic 
concept is that
+``TimeWindowCompactionStrategy`` will create 1 sstable per file for a given 
window, where a window is simply calculated
+as the combination of two primary options:
+
+``compaction_window_unit`` (default: DAYS)
+    A Java TimeUnit (MINUTES, HOURS, or DAYS).
+``compaction_window_size`` (default: 1)
+    The number of units that make up a window.
+``unsafe_aggressive_sstable_expiration`` (default: false)
+    Expired sstables will be dropped without checking its data is shadowing 
other sstables. This is a potentially
+    risky option that can lead to data loss or deleted data re-appearing, 
going beyond what
+    `unchecked_tombstone_compaction` does for single  sstable compaction. Due 
to the risk the jvm must also be
+    started with `-Dcassandra.unsafe_aggressive_sstable_expiration=true`.
+
+Taken together, the operator can specify windows of virtually any size, and 
`TimeWindowCompactionStrategy` will work to
+create a single sstable for writes within that window. For efficiency during 
writing, the newest window will be
+compacted using `SizeTieredCompactionStrategy`.
+
+Ideally, operators should select a ``compaction_window_unit`` and 
``compaction_window_size`` pair that produces
+approximately 20-30 windows - if writing with a 90 day TTL, for example, a 3 
Day window would be a reasonable choice
+(``'compaction_window_unit':'DAYS','compaction_window_size':3``).
+
+TimeWindowCompactionStrategy Operational Concerns
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The primary motivation for TWCS is to separate data on disk by timestamp and 
to allow fully expired SSTables to drop
+more efficiently. One potential way this optimal behavior can be subverted is 
if data is written to SSTables out of
+order, with new data and old data in the same SSTable. Out of order data can 
appear in two ways:
+
+- If the user mixes old data and new data in the traditional write path, the 
data will be comingled in the memtables
+  and flushed into the same SSTable, where it will remain comingled.
+- If the user's read requests for old data cause read repairs that pull old 
data into the current memtable, that data
+  will be comingled and flushed into the same SSTable.
+
+While TWCS tries to minimize the impact of comingled data, users should 
attempt to avoid this behavior.  Specifically,
+users should avoid queries that explicitly set the timestamp via CQL ``USING 
TIMESTAMP``. Additionally, users should run
+frequent repairs (which streams data in such a way that it does not become 
comingled).
+
+Changing TimeWindowCompactionStrategy Options
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Operators wishing to enable ``TimeWindowCompactionStrategy`` on existing data 
should consider running a major compaction
+first, placing all existing data into a single (old) window. Subsequent newer 
writes will then create typical SSTables
+as expected.
+
+Operators wishing to change ``compaction_window_unit`` or 
``compaction_window_size`` can do so, but may trigger
+additional compactions as adjacent windows are joined together. If the window 
size is decrease d (for example, from 24
+hours to 12 hours), then the existing SSTables will not be modified - TWCS can 
not split existing SSTables into multiple
+windows.
+
diff --git a/doc/source/operating/hardware.rst 
b/doc/source/operating/hardware.rst
index ad3aa8d..d90550c 100644
--- a/doc/source/operating/hardware.rst
+++ b/doc/source/operating/hardware.rst
@@ -77,8 +77,6 @@ Many large users of Cassandra run in various clouds, 
including AWS, Azure, and G
 of these environments. Users should choose similar hardware to what would be 
needed in physical space. In EC2, popular
 options include:
 
-- m1.xlarge instances, which provide 1.6TB of local ephemeral spinning storage 
and sufficient RAM to run moderate
-  workloads
 - i2 instances, which provide both a high RAM:CPU ratio and local ephemeral 
SSDs
 - m4.2xlarge / c4.4xlarge instances, which provide modern CPUs, enhanced 
networking and work well with EBS GP2 (SSD)
   storage
diff --git a/doc/source/operating/index.rst b/doc/source/operating/index.rst
index e2cead2..78c7eb6 100644
--- a/doc/source/operating/index.rst
+++ b/doc/source/operating/index.rst
@@ -27,7 +27,7 @@ Operating Cassandra
    repair
    read_repair
    hints
-   compaction
+   compaction/index
    bloom_filters
    compression
    cdc


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[cassandra] branch trunk updated: Added production recommendations and improved compaction doc organization.

Reply via email to