Add Change Data Capture documentation
Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/51b939c9 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/51b939c9 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/51b939c9 Branch: refs/heads/trunk Commit: 51b939c91db5d1a7664d76c8f57160f2570ee1dd Parents: 7bf837c Author: Josh McKenzie <[email protected]> Authored: Mon Jun 20 13:38:00 2016 -0400 Committer: Sylvain Lebresne <[email protected]> Committed: Tue Jun 21 14:12:59 2016 +0200 ---------------------------------------------------------------------- doc/source/operations.rst | 75 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 73 insertions(+), 2 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/cassandra/blob/51b939c9/doc/source/operations.rst ---------------------------------------------------------------------- diff --git a/doc/source/operations.rst b/doc/source/operations.rst index 9094766..d7fcafb 100644 --- a/doc/source/operations.rst +++ b/doc/source/operations.rst @@ -338,7 +338,6 @@ There is a number of common options for all the compaction strategies; ``enabled`` (default: true) Whether minor compactions should run. Note that you can have 'enabled': true as a compaction option and then do 'nodetool enableautocompaction' to start running compactions. - Default true. ``tombstone_threshold`` (default: 0.2) How much of the sstable should be tombstones for us to consider doing a single sstable compaction of that sstable. ``tombstone_compaction_interval`` (default: 86400s (1 day)) @@ -738,7 +737,7 @@ similar text columns (such as repeated JSON blobs) often compress very well. Operational Impact ^^^^^^^^^^^^^^^^^^ -- Compression metadata is stored offheap and scales with data on disk. This often requires 1-3GB of offheap RAM per +- Compression metadata is stored off-heap and scales with data on disk. This often requires 1-3GB of off-heap RAM per terabyte of data on disk, though the exact usage varies with ``chunk_length_in_kb`` and compression ratios. - Streaming operations involve compressing and decompressing data on compressed tables - in some code paths (such as @@ -754,6 +753,78 @@ Advanced Use Advanced users can provide their own compression class by implementing the interface at ``org.apache.cassandra.io.compress.ICompressor``. +Change Data Capture +------------------- + +Overview +^^^^^^^^ + +Change data capture (CDC) provides a mechanism to flag specific tables for archival as well as rejecting writes to those +tables once a configurable size-on-disk for the combined flushed and unflushed CDC-log is reached. An operator can +enable CDC on a table by setting the table property ``cdc=true`` (either when :ref:`creating the table +<create-table-statement>` or :ref:`altering it <alter-table-statement>`), after which any CommitLogSegments containing +data for a CDC-enabled table are moved to the directory specified in ``cassandra.yaml`` on segment discard. A threshold +of total disk space allowed is specified in the yaml at which time newly allocated CommitLogSegments will not allow CDC +data until a consumer parses and removes data from the destination archival directory. + +Configuration +^^^^^^^^^^^^^ + +Enabling or disable CDC on a table +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +CDC is enable or disable through the `cdc` table property, for instance:: + + CREATE TABLE foo (a int, b text, PRIMARY KEY(a)) WITH cdc=true; + + ALTER TABLE foo WITH cdc=true; + + ALTER TABLE foo WITH cdc=false; + +cassandra.yaml parameters +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following `cassandra.yaml` are available for CDC: + +``cdc_enabled`` (default: false) + Enable or disable CDC operations node-wide. +``cdc_raw_directory`` (default: ``$CASSANDRA_HOME/data/cdc_raw``) + Destination for CommitLogSegments to be moved after all corresponding memtables are flushed. +``cdc_free_space_in_mb``: (default: min of 4096 and 1/8th volume space) + Calculated as sum of all active CommitLogSegments that permit CDC + all flushed CDC segments in + ``cdc_raw_directory``. +``cdc_free_space_check_interval_ms`` (default: 250) + When at capacity, we limit the frequency with which we re-calculate the space taken up by ``cdc_raw_directory`` to + prevent burning CPU cycles unnecessarily. Default is to check 4 times per second. + +.. _reading-commitlogsegments: + +Reading CommitLogSegments +^^^^^^^^^^^^^^^^^^^^^^^^^ +This implementation included a refactor of CommitLogReplayer into `CommitLogReader.java +<https://github.com/apache/cassandra/blob/e31e216234c6b57a531cae607e0355666007deb2/src/java/org/apache/cassandra/db/commitlog/CommitLogReader.java>`__. +Usage is `fairly straightforward +<https://github.com/apache/cassandra/blob/e31e216234c6b57a531cae607e0355666007deb2/src/java/org/apache/cassandra/db/commitlog/CommitLogReplayer.java#L132-L140>`__ +with a `variety of signatures +<https://github.com/apache/cassandra/blob/e31e216234c6b57a531cae607e0355666007deb2/src/java/org/apache/cassandra/db/commitlog/CommitLogReader.java#L71-L103>`__ +available for use. In order to handle mutations read from disk, implement `CommitLogReadHandler +<https://github.com/apache/cassandra/blob/e31e216234c6b57a531cae607e0355666007deb2/src/java/org/apache/cassandra/db/commitlog/CommitLogReadHandler.java>`__. + +Warnings +^^^^^^^^ + +**Do not enable CDC without some kind of consumption process in-place.** + +The initial implementation of Change Data Capture does not include a parser (see :ref:`reading-commitlogsegments` above) +so, if CDC is enabled on a node and then on a table, the ``cdc_free_space_in_mb`` will fill up and then writes to +CDC-enabled tables will be rejected unless some consumption process is in place. + +Further Reading +^^^^^^^^^^^^^^^ + +- `Design doc <https://docs.google.com/document/d/1ZxCWYkeZTquxsvf5hdPc0fiUnUHna8POvgt6TIzML4Y/edit>`__ +- `JIRA ticket <https://issues.apache.org/jira/browse/CASSANDRA-8844>`__ + Backups -------
