[jira] [Commented] (CASSANDRA-10195) TWCS experiments and improvement proposals

Jeff Jirsa (JIRA) Wed, 26 Aug 2015 08:27:10 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713663#comment-14713663
 ]


Jeff Jirsa commented on CASSANDRA-10195:
----------------------------------------

[~jbellis] - I have nothing formal, merely experience running DTCS on decent 
sized clusters, where topics like CASSANDRA-9661 and CASSANDRA-8371 and 
https://issues.apache.org/jira/browse/CASSANDRA-8340?focusedCommentId=14567183&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14567183
 (which match my own experience and anecdotal evidence from very experienced 
operators, old timestamps pollute the current window and cause new sstables to 
compact against older, larger windows, "confusing" dtcs, and causing ridiculous 
rolling compactions), and the selection criteria described in CASSANDRA-9644 
and CASSANDRA-9597 (DTCS sorts files within a window by oldest timestamp first, 
which generally chooses largest -> smallest, which causes large file to be 
recompacted over and over and over when > max_threshold files exist) cause 
significant problems over time.  

In terms of read amplification, TWCS' design DOES rely on many assumptions that 
probably vary with data model (for example, one assumption is that for systems 
sufficiently busy to need TTLs and DTCS, partitions will probably be limited to 
a short time period - this assumption is mirrored in Patrick McFadin's various 
talks on data modeling, where he adds a time/date component to the partition 
key creating time-based buckets - failure to do this with TWCS will likely 
cause partitions to split between TWCS windows, which could cause read 
amplification). 

If hard numbers will help TWCS make it into the project, I will happily compile 
those for you in the next few weeks. 

> TWCS experiments and improvement proposals
> ------------------------------------------
>
>                 Key: CASSANDRA-10195
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10195
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Antti Nissinen
>             Fix For: 2.1.x, 2.2.x
>
>         Attachments: 20150814_1027_compaction_hierarchy.txt, 
> node0_20150727_1250_time_graph.txt, node0_20150810_1017_time_graph.txt, 
> node0_20150812_1531_time_graph.txt, node0_20150813_0835_time_graph.txt, 
> node0_20150814_1054_time_graph.txt, node1_20150727_1250_time_graph.txt, 
> node1_20150810_1017_time_graph.txt, node1_20150812_1531_time_graph.txt, 
> node1_20150813_0835_time_graph.txt, node1_20150814_1054_time_graph.txt, 
> node2_20150727_1250_time_graph.txt, node2_20150810_1017_time_graph.txt, 
> node2_20150812_1531_time_graph.txt, node2_20150813_0835_time_graph.txt, 
> node2_20150814_1054_time_graph.txt, sstable_count_figure1.png, 
> sstable_count_figure2.png
>
>
> This JIRA item describes experiments with DateTieredCompactionStartegy (DTCS) 
> and TimeWindowCompactionStrategy (TWCS) and proposes modifications to the 
> TWCS. In a test system several crashes were caused intentionally (and 
> unintentionally) and repair operations were executed leading to flood of 
> small SSTables. Target was to be able compact those files are release disk 
> space reserved by duplicate data. Setup is following:
> - Three nodes
> - DateTieredCompactionStrategy, max_sstable_age_days = 5
>     Cassandra 2.1.2
> The setup and data format has been documented in detailed here 
> https://issues.apache.org/jira/browse/CASSANDRA-9644.
> The test was started by dumping  few days worth of data to the database for 
> 100 000 signals. Time graphs of SStables from different nodes indicates that 
> the DTCS has been working as expected and SStables are nicely ordered in time 
> wise.
> See files:
> node0_20150727_1250_time_graph.txt
> node1_20150727_1250_time_graph.txt
> node2_20150727_1250_time_graph.txt
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens  Owns    Host ID                         
>       Rack
> UN  139.66.43.170  188.87 GB  256     ?       
> dfc29863-c935-4909-9d7f-c59a47eda03d  rack1
> UN  139.66.43.169  198.37 GB  256     ?       
> 12e7628b-7f05-48f6-b7e4-35a82010021a  rack1
> UN  139.66.43.168  191.88 GB  256     ?       
> 26088392-f803-4d59-9073-c75f857fb332  rack1
> All nodes crashed due to power failure (know beforehand) and repair 
> operations were started for each node one at the time. Below is the behavior 
> of SSTable count on different nodes. New data was dumped simultaneously with 
> repair operation.
> SEE FIGURE: sstable_count_figure1.png
> Vertical lines indicate following events.
> 1) Cluster was down due to power shutdown and was restarted. At the first 
> vertical line the repair operation (nodetool repair -pr) was started for the 
> first node
> 2) Repair for the second repair operation was started after the first node 
> was successfully repaired.
> 3) Repair for the third repair operation was started
> 4) Third repair operation was finished
> 5) One of the nodes crashed (unknown reason in OS level)
> 6) Repair operation (nodetool repair -pr) was started for the first node
> 7) Repair operation for the second node was started
> 8) Repair operation for the third node was started
> 9) Repair operations finished
> These repair operations are leading to huge amount of small SSTables covering 
> the whole time span of the data. The compaction horizon of DTCS was limited 
> to 5 days (max_sstable_age_days) due to the size of the SStables on the disc. 
> Therefore, small SStables won't be compacted. Below are the time graphs from 
> SSTables after the second round of repairs.
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens  Owns    Host ID                         
>       Rack
> UN  xx.xx.xx.170  663.61 GB  256     ?       
> dfc29863-c935-4909-9d7f-c59a47eda03d  rack1
> UN  xx.xx.xx.169  763.52 GB  256     ?       
> 12e7628b-7f05-48f6-b7e4-35a82010021a  rack1
> UN  xx.xx.xx.168  651.59 GB  256     ?       
> 26088392-f803-4d59-9073-c75f857fb332  rack1
> See files:
> node0_20150810_1017_time_graph.txt
> node1_20150810_1017_time_graph.txt
> node2_20150810_1017_time_graph.txt
> To get rid of the SStables the TimeWindowCompactionStrategy was taken into 
> use. Window size was set to 5 days. Cassandra version was updated to 2.1.8. 
> Below figure shows the behavior of SStable count. TWCS was taken into use 
> 10.8.2015 at 13:10. The maximum amount of files to be compacted in one task 
> was limited to 32 files to avoid running out of disk space.
> See Figure: sstable_count_figure2.png
> Shape of the trend indicates clearly how selection of SStables for buckets 
> based on size affects. Combining files gets slower when files are getting 
> bigger inside the time window. When the time window does not have any more 
> compactions to be done the next time window is started. Combining small files 
> is again fast and the number of SStables decreases quickly.  Below are the 
> time graphs for SStables when compactions were ready with TWCS. New data was 
> not dumped simultaneously with compactions.
> See files:
> node0_20150812_1531_time_graph.txt
> node1_20150812_1531_time_graph.txt
> node2_20150812_1531_time_graph.txt
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address        Load       Tokens  Owns    Host ID                         
>       Rack
> UN  xx.xx.xx.170  436.17 GB  256     ?       
> dfc29863-c935-4909-9d7f-c59a47eda03d  rack1
> UN  xx.xx.xx.169  454.96 GB  256     ?       
> 12e7628b-7f05-48f6-b7e4-35a82010021a  rack1
> UN  xx.xx.xx.168  439.13 GB  256     ?       
> 26088392-f803-4d59-9073-c75f857fb332  rack1
> Data dumping was activated again and the SStable statistics were observed 
> again on the next morning.
> See files:
> node0_20150813_0835_time_graph.txt
> node1_20150813_0835_time_graph.txt
> node2_20150813_0835_time_graph.txt
> Since the data was dumped to the history the newest data did not come into 
> the current time window that is determined from the system time. Since new 
> small SSTables (approximately 30- 50 MB in size) are appearing continuously 
> the compaction ended up compacting together one large SStable with several 
> small files. The code was modified so that the current time is determined 
> from the newest time stamp in the SStables (like in DTCS).This modification 
> led to much more reasonable compaction behavior for the case when historical 
> data is pushed to the database. Below are the time grahps from nodes after 
> one day. Size tiered compaction was now able to work with newest files as 
> intended while dumping data in real-time.
> See files:
> node0_20150814_1054_time_graph.txt
> node1_20150814_1054_time_graph.txt
> node2_20150814_1054_time_graph.txt
> The change in behavior is clearly visible in the compaction hierarchy graph 
> below. TWCS modification is visible starting from the line 39. See the 
> description of the file format in 
> https://issues.apache.org/jira/browse/CASSANDRA-9644.
> See file: 20150814_1027_compaction_hierarchy.txt
> The behavior of the TWCS looks really promising and works also in practice!!!
> We would like to propose some ideas for future development of the algorithm.
> 1) The current time window would be determined from the newest time stamp 
> found in SSTables. This allows the effective compaction of the SSTables when 
> data is fed to the history in timely order. In dumping process the time stamp 
> of the column is set according to the time stamp of the data sample.
> 2) The count of SSTables participating in one compaction could be limited 
> either by the number of files given by max_threshold OR by the sum of size of 
> files selected for the compaction bucket. File size limitation would prevent 
> combining a large files together potentially causing out of disk space 
> situation or extremely long lasting compaction tasks.
> 3) Now time windows are handled one by one starting from the newest. This 
> will not lead to the fastest decrease in SStable count. An alternative might 
> a round-robin approach in which time windows are stepped through and only one 
> compaction task for that given time window is done and then moving to the 
> next time window.
> Side note: while observing the compaction process it appears that compaction 
> is intermittently using two threads for the compaction. However, sometimes 
> during a long lasting compaction task (hours) another thread was not kicking 
> in and working with small SSTables even if there were thousands of those 
> available for compactions.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10195) TWCS experiments and improvement proposals

Reply via email to