[
https://issues.apache.org/jira/browse/CASSANDRA-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14615062#comment-14615062
]
Jeff Jirsa commented on CASSANDRA-9644:
---------------------------------------
{quote}
Should also note that using incremental repair makes sure that we don't stream
in a bunch of data in the old windows
{quote}
No such guarantee exists, you'll still stream in a bunch of data into old
windows during bootstrap and decom.
> DTCS configuration proposals for handling consequences of repairs
> -----------------------------------------------------------------
>
> Key: CASSANDRA-9644
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9644
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Antti Nissinen
> Fix For: 3.x, 2.1.x
>
> Attachments: node0_20150621_1646_time_graph.txt,
> node0_20150621_2320_time_graph.txt, node0_20150623_1526_time_graph.txt,
> node1_20150621_1646_time_graph.txt, node1_20150621_2320_time_graph.txt,
> node1_20150623_1526_time_graph.txt, node2_20150621_1646_time_graph.txt,
> node2_20150621_2320_time_graph.txt, node2_20150623_1526_time_graph.txt,
> nodetool status infos.txt, sstable_compaction_trace.txt,
> sstable_compaction_trace_snipped.txt, sstable_counts.jpg
>
>
> This is a document bringing up some issues when DTCS is used to compact time
> series data in a three node cluster. The DTCS is currently configured with a
> few parameters that are making the configuration fairly simple, but might
> cause problems in certain special cases like recovering from the flood of
> small SSTables due to repair operation. We are suggesting some ideas that
> might be a starting point for further discussions. Following sections are
> containing:
> - Description of the cassandra setup
> - Feeding process of the data
> - Failure testing
> - Issues caused by the repair operations for the DTCS
> - Proposal for the DTCS configuration parameters
> Attachments are included to support the discussion and there is a separate
> section giving explanation for those.
> Cassandra setup and data model
> - Cluster is composed from three nodes running Cassandra 2.1.2. Replication
> factor is two and read and write consistency levels are ONE.
> - Data is time series data. Data is saved so that one row contains a certain
> time span of data for a given metric ( 20 days in this case). The row key
> contains information about the start time of the time span and metrix name.
> Column name gives the offset from the beginning of time span. Column time
> stamp is set to correspond time stamp when adding together the timestamp from
> the row key and the offset (the actual time stamp of data point). Data model
> is analog to KairosDB implementation.
> - Average sampling rate is 10 seconds varying significantly from metric to
> metric.
> - 100 000 metrics are fed to the Cassandra.
> - max_sstable_age_days is set to 5 days (objective is to keep SStable files
> in manageable size, around 50 GB)
> - TTL is not in use in the test.
> Procedure for the failure test.
> - Data is first dumped to Cassandra for 11 days and the data dumping is
> stopped so that DTCS will have a change to finish all compactions. Data is
> dumped with "fake timestamps" so that column time stamp is set when data is
> written to Cassandra.
> - One of the nodes is taken down and new data is dumped on top of the earlier
> data covering couple of hours worth of data (faked time stamps).
> - Dumping is stopped and the node is kept down for few hours.
> - Node is taken up and the "nodetool repair" is applied on the node that was
> down.
> Consequences
> - Repair operation will lead to massive amount of new SStables far back in
> the history. New SStables are covering similar time spans than the files that
> were created by DTCS before the shutdown of one of the nodes.
> - To be able to compact the small files the max_sstable_age_days should be
> increased to allow compaction to handle the files. However, the in a
> practical case the time window will increase so large that generated files
> will be huge that is not desirable. The compaction also combines together one
> very large file with a bunch of small files in several phases that is not
> effective. Generating really large files may also lead to out of disc space
> problems.
> - See the list of time graphs later in the document.
> Improvement proposals for the DTCS configuration
> Below is a list of desired properties for the configuration. Current
> parameters are mentioned if available.
> - Initial window size (currently:base_time_seconds)
> - The amount of similar size windows for the bucketing (currently:
> min_threshold)
> - The multiplier for the window size when increased (currently:
> min_threshold). This we would like to be independent from the min_threshold
> parameter so that you could actually control the rate how fast the window
> size is increased.
> - Maximum length of the time window inside which the files are assigned for a
> certain bucket (not currently defined). This means that expansion of time
> window length is restricted. When the limit is reached the window size will
> be same all the way back in the history (e.g. one week)
> - The maximum horizon in which SStables are candidates for buckets
> (currently: max_sstable_age_days)
> - Maximum file size of SStable allowed to be in a set of files to be
> compacted (not possible currently). Preventing out of disk space situations.
> - Optional strategies to select the most interesting bucket:
> - Minimum amount of SStables in the time window before it is a candidate
> for the most interesting bucket (currently: min_threshold for the most recent
> window, otherwise two). Being able set this value independently would allow
> to put most of the efforts on those areas where a large amount of small files
> should be compacted together instead of few new files.
> - Optionally, the criteria for the most interesting bucket could be set:
> e.g. select the window with most files to be compacted.
> - Inside the bucket when the amount of files is limited by max_threshold,
> the compaction would select first small files instead of one huge file and a
> bunch of small files.
> The above set of parameters allows to recover from repair operations
> producing large amount of small SStables.
> - Maximum length of the time window for compactions would keep the compacted
> SStable size in reasonable range and would allow to extend the horizon far
> back in the history
> - Combining small files together instead of combining one huge file with e.g.
> 31 small files again and again is more disk efficient
> In addition to the previous advantages the above parameters would also allow:
> - Dumping of more data in the history (e.g. new metrics) by assigning the
> correct timestamp for the column (fake time stamp) and proper compaction of
> new and existing SStables.
> - Expiring reasonable size SStable with TTL even if the compactions would be
> intermittently executed far back in the history. In this case the new data
> has to fed with TTL calculated dynamically.
> - Note: Being able to give the absolute time stamp for the column expiry time
> would be beneficial when data is dumped back in the history. This is the case
> when you move data from some legacy system to Cassandra with faked time
> stamps and would like to keep the data only a certain time period. Currently
> the absolute time stamp is calculated by Cassandra from the system time and
> given TTL. TTL has to be calculated dynamically based on the current time and
> desired expiry moment making things more complex.
> One interesting question is that why those duplicate SStable files are
> created? The duplication problem could not be produced when the data was
> dumped with following spec:
> - 100 metrics
> - 20 days of data in one row
> - one year of data
> - max_sstable_age_days = 15
> - memtable_offheap_space_in_mb was decreased so that small SStables were
> created (to create something to be compacted)
> - One node was taken down and one more day of data was dumped on top of the
> earlier data
> - "nodetool repair -pr" was executed on each node => duplicates were checked
> in each step => no duplicates
> - "nodetool repair" was executed on a node that was down => no duplicates
> were generated
> --------------------------------------------------------------------------------------------------------------------
> ATTACHMENTS
> Time graphs of content of SSTables from different phases of the test run:
> *************************************************************
> Fields in the below time graphs are following:
> - Order number from the SSTable file name
> - Minimum column timestamp in the SSTable file
> - Timespan representation graphically
> - Maximum column time stamp in SStable
> - The size of the SStable in megabytes
> Time graphs after dumping the 11 days of data and letting all compactions to
> run through
> node0_20150621_1646_time_graph.txt
> node1_20150621_1646_time_graph.txt
> node2_20150621_1646_time_graph.txt (error: same as for node1, but the
> behavior is same)
> Time graphs after taking one node down (node2) and dumping couple of hours of
> mode data
> node0_20150621_2320_time_graph.txt
> node1_20150621_2320_time_graph.txt
> node2_20150621_2320_time_graph.txt
> Time graphs when the repair operation has finished and compactions are done.
> Compactions will naturally handle only the files inside the
> max_sstable_age_days range.
> ==> Now there is a large amount of small files covering pretty much same
> areas as the original SStables
> node0_20150623_1526_time_graph.txt
> node1_20150623_1526_time_graph.txt
> node2_20150623_1526_time_graph.txt
> -----------
> Trend from the SStable count as a function of time on each node.
> sstable_counts.jpg
> Vertical lines:
> 1) Clearing the database and dumping the 11 days worth of data
> 2) Stopping the dumping and letting compactions run
> 3) Taking one node down (top bottom one in figure) and dumping few hours of
> new data on top of earlier data
> 4) Starting the repair operation
> 5) Repair operation finished
> --------
> Nodetool status prints before and after repair operation
> nodetool status infos.txt
> --------------
> Tracing compactions
> Log files were parsed to demonstrate the creation of new small SStables and
> the combination of one large file with a bunch of small ones. This is done
> from the time range where the max_sstable_age_days is able to reach (in this
> case 5 days). The hierarchy of the files is shown in the file
> "sstable_compaction_trace.txt". The first flushed file can be found from the
> line 10.
> Each line represents either a flushed SStable or the SStable created by DTCS.
> For flushed files the timestamp indicates the time period the file
> represents. For compacted files (marked with C) the first timestamp
> represents the moment when the compaction was done (wall clock). Time stamps
> are faked when written to the database. The size of the file is the last
> field. The first field with number in parenthesis shows the level of the
> file. Top level files marked with (0) are those that don't have any
> predecessors and should be found from the disk also.
> SStables that are created by the repair operation are not mentioned in the
> log files so they are handled as phantom files. The existence of file can be
> concluded from the predecessor list of compacted SStable. Those are marked
> with None,None in timestamps.
> In the file "sstable_compaction_trace_snipped.txt" is one portion that shows
> the compaction hierarchy for the small files originating from the repair
> operation. max_threshold is in the default value of 32. In each step 31 tiny
> files are compacted together with 46 GB file.
> sstable_compaction_trace.txt
> sstable_compaction_trace_snipped.txt
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)