[jira] [Commented] (CASSANDRA-9644) DTCS configuration proposals for handling consequences of repairs

Jeff Jirsa (JIRA) Mon, 06 Jul 2015 07:11:21 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14615062#comment-14615062
 ]


Jeff Jirsa commented on CASSANDRA-9644:
---------------------------------------

{quote}
Should also note that using incremental repair makes sure that we don't stream 
in a bunch of data in the old windows
{quote}

No such guarantee exists, you'll still stream in a bunch of data into old 
windows during bootstrap and decom.



> DTCS configuration proposals for handling consequences of repairs
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-9644
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9644
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Antti Nissinen
>             Fix For: 3.x, 2.1.x
>
>         Attachments: node0_20150621_1646_time_graph.txt, 
> node0_20150621_2320_time_graph.txt, node0_20150623_1526_time_graph.txt, 
> node1_20150621_1646_time_graph.txt, node1_20150621_2320_time_graph.txt, 
> node1_20150623_1526_time_graph.txt, node2_20150621_1646_time_graph.txt, 
> node2_20150621_2320_time_graph.txt, node2_20150623_1526_time_graph.txt, 
> nodetool status infos.txt, sstable_compaction_trace.txt, 
> sstable_compaction_trace_snipped.txt, sstable_counts.jpg
>
>
> This is a document bringing up some issues when DTCS is used to compact time 
> series data in a three node cluster. The DTCS is currently configured with a 
> few parameters that are making the configuration fairly simple, but might 
> cause problems in certain special cases like recovering from the flood of 
> small SSTables due to repair operation. We are suggesting some ideas that 
> might be a starting point for further discussions. Following sections are 
> containing:
> - Description of the cassandra setup
> - Feeding process of the data
> - Failure testing
> - Issues caused by the repair operations for the DTCS
> - Proposal for the DTCS configuration parameters
> Attachments are included to support the discussion and there is a separate 
> section giving explanation for those.
> Cassandra setup and data model
> - Cluster is composed from three nodes running Cassandra 2.1.2. Replication 
> factor is two and read and write consistency levels are ONE.
> - Data is time series data. Data is saved so that one row contains a certain 
> time span of data for a given metric ( 20 days in this case). The row key 
> contains information about the start time of the time span and metrix name. 
> Column name gives the offset from the beginning of time span. Column time 
> stamp is set to correspond time stamp when adding together the timestamp from 
> the row key and the offset (the actual time stamp of data point). Data model 
> is analog to KairosDB implementation.
> - Average sampling rate is 10 seconds varying significantly from metric to 
> metric.
> - 100 000 metrics are fed to the Cassandra.
> - max_sstable_age_days is set to 5 days (objective is to keep SStable files 
> in manageable size, around 50 GB)
> - TTL is not in use in the test.
> Procedure for the failure test.
> - Data is first dumped to Cassandra for 11 days and the data dumping is 
> stopped so that DTCS will have a change to finish all compactions. Data is 
> dumped with "fake timestamps" so that column time stamp is set when data is 
> written to Cassandra.
> - One of the nodes is taken down and new data is dumped on top of the earlier 
> data covering couple of hours worth of data (faked time stamps).
> - Dumping is stopped and the node is kept down for few hours.
> - Node is taken up and the "nodetool repair" is applied on the node that was 
> down.
> Consequences
> - Repair operation will lead to massive amount of new SStables far back in 
> the history. New SStables are covering similar time spans than the files that 
> were created by DTCS before the shutdown of one of the nodes.
> - To be able to compact the small files the max_sstable_age_days should be 
> increased to allow compaction to handle the files. However, the in a 
> practical case the time window will increase so large that generated files 
> will be huge that is not desirable. The compaction also combines together one 
> very large file with a bunch of small files in several phases that is not 
> effective. Generating really large files may also lead to out of disc space 
> problems.
> - See the list of time graphs later in the document.
> Improvement proposals for the DTCS configuration
> Below is a list of desired properties for the configuration. Current 
> parameters are mentioned if available.
> - Initial window size (currently:base_time_seconds)
> - The amount of similar size windows for the bucketing (currently: 
> min_threshold)
> - The multiplier for the window size when increased (currently: 
> min_threshold). This we would like to be independent from the min_threshold 
> parameter so that you could actually control the rate how fast the window 
> size is increased.
> - Maximum length of the time window inside which the files are assigned for a 
> certain bucket (not currently defined). This means that expansion of time 
> window length is restricted. When the limit is reached the window size will 
> be same all the way back in the history (e.g. one week)
> - The maximum horizon in which SStables are candidates for buckets 
> (currently: max_sstable_age_days)
> - Maximum file size of SStable allowed to be in a set of files to be 
> compacted (not possible currently). Preventing out of disk space situations.
> - Optional strategies to select the most interesting bucket:
>     - Minimum amount of SStables in the time window before it is a candidate 
> for the most interesting bucket (currently: min_threshold for the most recent 
> window, otherwise two). Being able set this value independently would allow 
> to put most of the efforts on those areas where a large amount of small files 
> should be compacted together instead of few new files.
>     - Optionally, the criteria for the most interesting bucket could be set: 
> e.g. select the window with most files to be compacted.
>     - Inside the bucket when the amount of files is limited by max_threshold, 
> the compaction would select first small files instead of one huge file and a 
> bunch of small files.
> The above set of parameters allows to recover from repair operations 
> producing large amount of small SStables.
> - Maximum length of the time window for compactions would keep the compacted 
> SStable size in reasonable range and would allow to extend the horizon far 
> back in the history
> - Combining small files together instead of combining one huge file with e.g. 
> 31 small files again and again is more disk efficient
> In addition to the previous advantages the above parameters would also allow:
> - Dumping of more data in the history (e.g. new metrics) by assigning the 
> correct timestamp for the column (fake time stamp) and proper compaction of 
> new and existing SStables.
> - Expiring reasonable size SStable with TTL even if the compactions would be 
> intermittently executed far back in the history. In this case the new data 
> has to fed with TTL calculated dynamically.
> - Note: Being able to give the absolute time stamp for the column expiry time 
> would be beneficial when data is dumped back in the history. This is the case 
> when you move data from some legacy system to Cassandra with faked time 
> stamps and would like to keep the data only a certain time period.  Currently 
> the absolute time stamp is calculated by Cassandra from the system time and 
> given TTL. TTL has to be calculated dynamically based on the current time and 
> desired expiry moment making things more complex.
> One interesting question is that why those duplicate SStable files are 
> created? The duplication problem could not be produced when the data was 
> dumped with following spec:
> - 100 metrics
> - 20 days of data in one row
> - one year of data
> - max_sstable_age_days = 15
>  - memtable_offheap_space_in_mb was decreased so that small SStables were 
> created (to create something to be compacted)
>  - One node was taken down and one more day of data was dumped on top of the 
> earlier data
> - "nodetool repair -pr" was executed on each node => duplicates were checked 
> in each step => no duplicates
> - "nodetool repair" was executed on a node that was down => no duplicates 
> were generated
> --------------------------------------------------------------------------------------------------------------------
> ATTACHMENTS
> Time graphs of content of SSTables from different phases of the test run:
> *************************************************************
> Fields in the below time graphs are following:
> - Order number from the  SSTable file name
> - Minimum column timestamp in the SSTable file
> - Timespan representation graphically
> - Maximum column time stamp in SStable
> - The size of the SStable in megabytes
> Time graphs after dumping the 11 days of data and letting all compactions to 
> run through
> node0_20150621_1646_time_graph.txt
> node1_20150621_1646_time_graph.txt
> node2_20150621_1646_time_graph.txt (error: same as for node1, but the 
> behavior is same)
> Time graphs after taking one node down (node2) and dumping couple of hours of 
> mode data
> node0_20150621_2320_time_graph.txt
> node1_20150621_2320_time_graph.txt
> node2_20150621_2320_time_graph.txt 
> Time graphs when the repair operation has finished and compactions are done. 
> Compactions will naturally handle only the files inside the 
> max_sstable_age_days range.
> ==> Now there is a large amount of small files covering pretty much same 
> areas as the original SStables
> node0_20150623_1526_time_graph.txt
> node1_20150623_1526_time_graph.txt
> node2_20150623_1526_time_graph.txt
> -----------
> Trend from the SStable count as a function of time on each node.
> sstable_counts.jpg
> Vertical lines:
> 1) Clearing the database and dumping the 11 days worth of data
> 2) Stopping the dumping and letting compactions run
> 3) Taking one node down (top bottom one in figure) and dumping few hours of 
> new data on top of earlier data
> 4) Starting the repair operation
> 5) Repair operation finished
> --------
> Nodetool status prints before and after repair operation
> nodetool status infos.txt
> --------------
> Tracing compactions
> Log files were parsed to demonstrate the creation of new small SStables and 
> the combination of one large file with a bunch of small ones. This is done 
> from the time range where the max_sstable_age_days is able to reach (in this 
> case 5 days). The hierarchy of the files is shown in the file 
> "sstable_compaction_trace.txt". The first flushed file can be found from the 
> line 10.
> Each line represents either a flushed SStable or the SStable created by DTCS. 
> For flushed files the timestamp indicates the time period the file 
> represents. For compacted files (marked with C) the first timestamp 
> represents the moment when the compaction was done (wall clock). Time stamps 
> are faked when written to the database. The size of the file is the last 
> field. The first field with number in parenthesis shows the level of the 
> file. Top level files marked with (0) are those that don't have any 
> predecessors and should be found from the disk also.
> SStables that are created by the repair operation are not mentioned in the 
> log files so they are handled as phantom files. The existence of file can be 
> concluded from the predecessor list of compacted SStable. Those are marked 
> with None,None in timestamps.
> In the file "sstable_compaction_trace_snipped.txt" is one portion that shows 
> the compaction hierarchy for the small files originating from the repair 
> operation. max_threshold is in the default value of 32. In each step 31 tiny 
> files are compacted together with 46 GB file.
> sstable_compaction_trace.txt
> sstable_compaction_trace_snipped.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9644) DTCS configuration proposals for handling consequences of repairs

Reply via email to