[jira] [Comment Edited] (CASSANDRA-10195) TWCS experiments and improvement proposals

Philip Thompson (JIRA) Thu, 10 Sep 2015 10:32:29 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14739091#comment-14739091
 ]


Philip Thompson edited comment on CASSANDRA-10195 at 9/10/15 5:31 PM:
----------------------------------------------------------------------

I will start those tests now, but it will take a few days for them to run. Do 
you need me to set any special compaction options?


was (Author: philipthompson):
I will start those tests now, but it will take a few days for them to run.

> TWCS experiments and improvement proposals
> ------------------------------------------
>
>                 Key: CASSANDRA-10195
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10195
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Antti Nissinen
>             Fix For: 2.1.x, 2.2.x
>
>         Attachments: 20150814_1027_compaction_hierarchy.txt, 
> node0_20150727_1250_time_graph.txt, node0_20150810_1017_time_graph.txt, 
> node0_20150812_1531_time_graph.txt, node0_20150813_0835_time_graph.txt, 
> node0_20150814_1054_time_graph.txt, node1_20150727_1250_time_graph.txt, 
> node1_20150810_1017_time_graph.txt, node1_20150812_1531_time_graph.txt, 
> node1_20150813_0835_time_graph.txt, node1_20150814_1054_time_graph.txt, 
> node2_20150727_1250_time_graph.txt, node2_20150810_1017_time_graph.txt, 
> node2_20150812_1531_time_graph.txt, node2_20150813_0835_time_graph.txt, 
> node2_20150814_1054_time_graph.txt, sstable_count_figure1.png, 
> sstable_count_figure2.png
>
>
> This JIRA item describes experiments with DateTieredCompactionStartegy (DTCS) 
> and TimeWindowCompactionStrategy (TWCS) and proposes modifications to the 
> TWCS. In a test system several crashes were caused intentionally (and 
> unintentionally) and repair operations were executed leading to flood of 
> small SSTables. Target was to be able compact those files are release disk 
> space reserved by duplicate data. Setup is following:
> - Three nodes
> - DateTieredCompactionStrategy, max_sstable_age_days = 5
>     Cassandra 2.1.2
> The setup and data format has been documented in detailed here 
> https://issues.apache.org/jira/browse/CASSANDRA-9644.
> The test was started by dumping  few days worth of data to the database for 
> 100 000 signals. Time graphs of SStables from different nodes indicates that 
> the DTCS has been working as expected and SStables are nicely ordered in time 
> wise.
> See files:
> node0_20150727_1250_time_graph.txt
> node1_20150727_1250_time_graph.txt
> node2_20150727_1250_time_graph.txt
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens  Owns    Host ID                         
>       Rack
> UN  139.66.43.170  188.87 GB  256     ?       
> dfc29863-c935-4909-9d7f-c59a47eda03d  rack1
> UN  139.66.43.169  198.37 GB  256     ?       
> 12e7628b-7f05-48f6-b7e4-35a82010021a  rack1
> UN  139.66.43.168  191.88 GB  256     ?       
> 26088392-f803-4d59-9073-c75f857fb332  rack1
> All nodes crashed due to power failure (know beforehand) and repair 
> operations were started for each node one at the time. Below is the behavior 
> of SSTable count on different nodes. New data was dumped simultaneously with 
> repair operation.
> SEE FIGURE: sstable_count_figure1.png
> Vertical lines indicate following events.
> 1) Cluster was down due to power shutdown and was restarted. At the first 
> vertical line the repair operation (nodetool repair -pr) was started for the 
> first node
> 2) Repair for the second repair operation was started after the first node 
> was successfully repaired.
> 3) Repair for the third repair operation was started
> 4) Third repair operation was finished
> 5) One of the nodes crashed (unknown reason in OS level)
> 6) Repair operation (nodetool repair -pr) was started for the first node
> 7) Repair operation for the second node was started
> 8) Repair operation for the third node was started
> 9) Repair operations finished
> These repair operations are leading to huge amount of small SSTables covering 
> the whole time span of the data. The compaction horizon of DTCS was limited 
> to 5 days (max_sstable_age_days) due to the size of the SStables on the disc. 
> Therefore, small SStables won't be compacted. Below are the time graphs from 
> SSTables after the second round of repairs.
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens  Owns    Host ID                         
>       Rack
> UN  xx.xx.xx.170  663.61 GB  256     ?       
> dfc29863-c935-4909-9d7f-c59a47eda03d  rack1
> UN  xx.xx.xx.169  763.52 GB  256     ?       
> 12e7628b-7f05-48f6-b7e4-35a82010021a  rack1
> UN  xx.xx.xx.168  651.59 GB  256     ?       
> 26088392-f803-4d59-9073-c75f857fb332  rack1
> See files:
> node0_20150810_1017_time_graph.txt
> node1_20150810_1017_time_graph.txt
> node2_20150810_1017_time_graph.txt
> To get rid of the SStables the TimeWindowCompactionStrategy was taken into 
> use. Window size was set to 5 days. Cassandra version was updated to 2.1.8. 
> Below figure shows the behavior of SStable count. TWCS was taken into use 
> 10.8.2015 at 13:10. The maximum amount of files to be compacted in one task 
> was limited to 32 files to avoid running out of disk space.
> See Figure: sstable_count_figure2.png
> Shape of the trend indicates clearly how selection of SStables for buckets 
> based on size affects. Combining files gets slower when files are getting 
> bigger inside the time window. When the time window does not have any more 
> compactions to be done the next time window is started. Combining small files 
> is again fast and the number of SStables decreases quickly.  Below are the 
> time graphs for SStables when compactions were ready with TWCS. New data was 
> not dumped simultaneously with compactions.
> See files:
> node0_20150812_1531_time_graph.txt
> node1_20150812_1531_time_graph.txt
> node2_20150812_1531_time_graph.txt
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens  Owns    Host ID                         
>       Rack
> UN  xx.xx.xx.170  436.17 GB  256     ?       
> dfc29863-c935-4909-9d7f-c59a47eda03d  rack1
> UN  xx.xx.xx.169  454.96 GB  256     ?       
> 12e7628b-7f05-48f6-b7e4-35a82010021a  rack1
> UN  xx.xx.xx.168  439.13 GB  256     ?       
> 26088392-f803-4d59-9073-c75f857fb332  rack1
> Data dumping was activated again and the SStable statistics were observed 
> again on the next morning.
> See files:
> node0_20150813_0835_time_graph.txt
> node1_20150813_0835_time_graph.txt
> node2_20150813_0835_time_graph.txt
> Since the data was dumped to the history the newest data did not come into 
> the current time window that is determined from the system time. Since new 
> small SSTables (approximately 30- 50 MB in size) are appearing continuously 
> the compaction ended up compacting together one large SStable with several 
> small files. The code was modified so that the current time is determined 
> from the newest time stamp in the SStables (like in DTCS).This modification 
> led to much more reasonable compaction behavior for the case when historical 
> data is pushed to the database. Below are the time grahps from nodes after 
> one day. Size tiered compaction was now able to work with newest files as 
> intended while dumping data in real-time.
> See files:
> node0_20150814_1054_time_graph.txt
> node1_20150814_1054_time_graph.txt
> node2_20150814_1054_time_graph.txt
> The change in behavior is clearly visible in the compaction hierarchy graph 
> below. TWCS modification is visible starting from the line 39. See the 
> description of the file format in 
> https://issues.apache.org/jira/browse/CASSANDRA-9644.
> See file: 20150814_1027_compaction_hierarchy.txt
> The behavior of the TWCS looks really promising and works also in practice!!!
> We would like to propose some ideas for future development of the algorithm.
> 1) The current time window would be determined from the newest time stamp 
> found in SSTables. This allows the effective compaction of the SSTables when 
> data is fed to the history in timely order. In dumping process the time stamp 
> of the column is set according to the time stamp of the data sample.
> 2) The count of SSTables participating in one compaction could be limited 
> either by the number of files given by max_threshold OR by the sum of size of 
> files selected for the compaction bucket. File size limitation would prevent 
> combining a large files together potentially causing out of disk space 
> situation or extremely long lasting compaction tasks.
> 3) Now time windows are handled one by one starting from the newest. This 
> will not lead to the fastest decrease in SStable count. An alternative might 
> a round-robin approach in which time windows are stepped through and only one 
> compaction task for that given time window is done and then moving to the 
> next time window.
> Side note: while observing the compaction process it appears that compaction 
> is intermittently using two threads for the compaction. However, sometimes 
> during a long lasting compaction task (hours) another thread was not kicking 
> in and working with small SSTables even if there were thousands of those 
> available for compactions.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-10195) TWCS experiments and improvement proposals

Reply via email to