Antti Nissinen created CASSANDRA-10195:
------------------------------------------

             Summary: TWCS experiments and improvement proposals
                 Key: CASSANDRA-10195
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10195
             Project: Cassandra
          Issue Type: Improvement
          Components: Core
            Reporter: Antti Nissinen
             Fix For: 2.1.x, 2.2.x
         Attachments: 20150814_1027_compaction_hierarchy.txt, 
node0_20150727_1250_time_graph.txt, node0_20150810_1017_time_graph.txt, 
node0_20150812_1531_time_graph.txt, node0_20150813_0835_time_graph.txt, 
node0_20150814_1054_time_graph.txt, node1_20150727_1250_time_graph.txt, 
node1_20150810_1017_time_graph.txt, node1_20150812_1531_time_graph.txt, 
node1_20150813_0835_time_graph.txt, node1_20150814_1054_time_graph.txt, 
node2_20150727_1250_time_graph.txt, node2_20150810_1017_time_graph.txt, 
node2_20150812_1531_time_graph.txt, node2_20150813_0835_time_graph.txt, 
node2_20150814_1054_time_graph.txt, sstable_count_figure1.png, 
sstable_count_figure2.png

This JIRA item describes experiments with DateTieredCompactionStartegy (DTCS) 
and TimeWindowCompactionStrategy (TWCS) and proposes modifications to the TWCS. 
In a test system several crashes were caused intentionally (and 
unintentionally) and repair operations were executed leading to flood of small 
SSTables. Target was to be able compact those files are release disk space 
reserved by duplicate data. Setup is following:

- Three nodes
- DateTieredCompactionStrategy, max_sstable_age_days = 5
    Cassandra 2.1.2

The setup and data format has been documented in detailed here 
https://issues.apache.org/jira/browse/CASSANDRA-9644.

The test was started by dumping  few days worth of data to the database for 100 
000 signals. Time graphs of SStables from different nodes indicates that the 
DTCS has been working as expected and SStables are nicely ordered in time wise.

See files:
node0_20150727_1250_time_graph.txt
node1_20150727_1250_time_graph.txt
node2_20150727_1250_time_graph.txt

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                           
    Rack
UN  139.66.43.170  188.87 GB  256     ?       
dfc29863-c935-4909-9d7f-c59a47eda03d  rack1
UN  139.66.43.169  198.37 GB  256     ?       
12e7628b-7f05-48f6-b7e4-35a82010021a  rack1
UN  139.66.43.168  191.88 GB  256     ?       
26088392-f803-4d59-9073-c75f857fb332  rack1

All nodes crashed due to power failure (know beforehand) and repair operations 
were started for each node one at the time. Below is the behavior of SSTable 
count on different nodes. New data was dumped simultaneously with repair 
operation.

SEE FIGURE: sstable_count_figure1.png

Vertical lines indicate following events.

1) Cluster was down due to power shutdown and was restarted. At the first 
vertical line the repair operation (nodetool repair -pr) was started for the 
first node
2) Repair for the second repair operation was started after the first node was 
successfully repaired.
3) Repair for the third repair operation was started
4) Third repair operation was finished
5) One of the nodes crashed (unknown reason in OS level)
6) Repair operation (nodetool repair -pr) was started for the first node
7) Repair operation for the second node was started
8) Repair operation for the third node was started
9) Repair operations finished

These repair operations are leading to huge amount of small SSTables covering 
the whole time span of the data. The compaction horizon of DTCS was limited to 
5 days (max_sstable_age_days) due to the size of the SStables on the disc. 
Therefore, small SStables won't be compacted. Below are the time graphs from 
SSTables after the second round of repairs.

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                           
    Rack
UN  xx.xx.xx.170  663.61 GB  256     ?       
dfc29863-c935-4909-9d7f-c59a47eda03d  rack1
UN  xx.xx.xx.169  763.52 GB  256     ?       
12e7628b-7f05-48f6-b7e4-35a82010021a  rack1
UN  xx.xx.xx.168  651.59 GB  256     ?       
26088392-f803-4d59-9073-c75f857fb332  rack1

See files:
node0_20150810_1017_time_graph.txt
node1_20150810_1017_time_graph.txt
node2_20150810_1017_time_graph.txt

To get rid of the SStables the TimeWindowCompactionStrategy was taken into use. 
Window size was set to 5 days. Cassandra version was updated to 2.1.8. Below 
figure shows the behavior of SStable count. TWCS was taken into use 10.8.2015 
at 13:10. The maximum amount of files to be compacted in one task was limited 
to 32 files to avoid running out of disk space.

See Figure: sstable_count_figure2.png

Shape of the trend indicates clearly how selection of SStables for buckets 
based on size affects. Combining files gets slower when files are getting 
bigger inside the time window. When the time window does not have any more 
compactions to be done the next time window is started. Combining small files 
is again fast and the number of SStables decreases quickly.  Below are the time 
graphs for SStables when compactions were ready with TWCS. New data was not 
dumped simultaneously with compactions.

See files:
node0_20150812_1531_time_graph.txt
node1_20150812_1531_time_graph.txt
node2_20150812_1531_time_graph.txt

Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                           
    Rack
UN  xx.xx.xx.170  436.17 GB  256     ?       
dfc29863-c935-4909-9d7f-c59a47eda03d  rack1
UN  xx.xx.xx.169  454.96 GB  256     ?       
12e7628b-7f05-48f6-b7e4-35a82010021a  rack1
UN  xx.xx.xx.168  439.13 GB  256     ?       
26088392-f803-4d59-9073-c75f857fb332  rack1

Data dumping was activated again and the SStable statistics were observed again 
on the next morning.

See files:
node0_20150813_0835_time_graph.txt
node1_20150813_0835_time_graph.txt
node2_20150813_0835_time_graph.txt

Since the data was dumped to the history the newest data did not come into the 
current time window that is determined from the system time. Since new small 
SSTables (approximately 30- 50 MB in size) are appearing continuously the 
compaction ended up compacting together one large SStable with several small 
files. The code was modified so that the current time is determined from the 
newest time stamp in the SStables (like in DTCS).This modification led to much 
more reasonable compaction behavior for the case when historical data is pushed 
to the database. Below are the time grahps from nodes after one day. Size 
tiered compaction was now able to work with newest files as intended while 
dumping data in real-time.

See files:
node0_20150814_1054_time_graph.txt
node1_20150814_1054_time_graph.txt
node2_20150814_1054_time_graph.txt

The change in behavior is clearly visible in the compaction hierarchy graph 
below. TWCS modification is visible starting from the line 39. See the 
description of the file format in 
https://issues.apache.org/jira/browse/CASSANDRA-9644.

See file: 20150814_1027_compaction_hierarchy.txt

The behavior of the TWCS looks really promising and works also in practice!!!
We would like to propose some ideas for future development of the algorithm.

1) The current time window would be determined from the newest time stamp found 
in SSTables. This allows the effective compaction of the SSTables when data is 
fed to the history in timely order. In dumping process the time stamp of the 
column is set according to the time stamp of the data sample.

2) The count of SSTables participating in one compaction could be limited 
either by the number of files given by max_threshold OR by the sum of size of 
files selected for the compaction bucket. File size limitation would prevent 
combining a large files together potentially causing out of disk space 
situation or extremely long lasting compaction tasks.

3) Now time windows are handled one by one starting from the newest. This will 
not lead to the fastest decrease in SStable count. An alternative might a 
round-robin approach in which time windows are stepped through and only one 
compaction task for that given time window is done and then moving to the next 
time window.

Side note: while observing the compaction process it appears that compaction is 
intermittently using two threads for the compaction. However, sometimes during 
a long lasting compaction task (hours) another thread was not kicking in and 
working with small SSTables even if there were thousands of those available for 
compactions.
 
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to