[
https://issues.apache.org/jira/browse/CASSANDRA-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14739091#comment-14739091
]
Philip Thompson commented on CASSANDRA-10195:
---------------------------------------------
I will start those tests now, but it will take a few days for them to run.
> TWCS experiments and improvement proposals
> ------------------------------------------
>
> Key: CASSANDRA-10195
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10195
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Antti Nissinen
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 20150814_1027_compaction_hierarchy.txt,
> node0_20150727_1250_time_graph.txt, node0_20150810_1017_time_graph.txt,
> node0_20150812_1531_time_graph.txt, node0_20150813_0835_time_graph.txt,
> node0_20150814_1054_time_graph.txt, node1_20150727_1250_time_graph.txt,
> node1_20150810_1017_time_graph.txt, node1_20150812_1531_time_graph.txt,
> node1_20150813_0835_time_graph.txt, node1_20150814_1054_time_graph.txt,
> node2_20150727_1250_time_graph.txt, node2_20150810_1017_time_graph.txt,
> node2_20150812_1531_time_graph.txt, node2_20150813_0835_time_graph.txt,
> node2_20150814_1054_time_graph.txt, sstable_count_figure1.png,
> sstable_count_figure2.png
>
>
> This JIRA item describes experiments with DateTieredCompactionStartegy (DTCS)
> and TimeWindowCompactionStrategy (TWCS) and proposes modifications to the
> TWCS. In a test system several crashes were caused intentionally (and
> unintentionally) and repair operations were executed leading to flood of
> small SSTables. Target was to be able compact those files are release disk
> space reserved by duplicate data. Setup is following:
> - Three nodes
> - DateTieredCompactionStrategy, max_sstable_age_days = 5
> Cassandra 2.1.2
> The setup and data format has been documented in detailed here
> https://issues.apache.org/jira/browse/CASSANDRA-9644.
> The test was started by dumping few days worth of data to the database for
> 100 000 signals. Time graphs of SStables from different nodes indicates that
> the DTCS has been working as expected and SStables are nicely ordered in time
> wise.
> See files:
> node0_20150727_1250_time_graph.txt
> node1_20150727_1250_time_graph.txt
> node2_20150727_1250_time_graph.txt
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns Host ID
> Rack
> UN 139.66.43.170 188.87 GB 256 ?
> dfc29863-c935-4909-9d7f-c59a47eda03d rack1
> UN 139.66.43.169 198.37 GB 256 ?
> 12e7628b-7f05-48f6-b7e4-35a82010021a rack1
> UN 139.66.43.168 191.88 GB 256 ?
> 26088392-f803-4d59-9073-c75f857fb332 rack1
> All nodes crashed due to power failure (know beforehand) and repair
> operations were started for each node one at the time. Below is the behavior
> of SSTable count on different nodes. New data was dumped simultaneously with
> repair operation.
> SEE FIGURE: sstable_count_figure1.png
> Vertical lines indicate following events.
> 1) Cluster was down due to power shutdown and was restarted. At the first
> vertical line the repair operation (nodetool repair -pr) was started for the
> first node
> 2) Repair for the second repair operation was started after the first node
> was successfully repaired.
> 3) Repair for the third repair operation was started
> 4) Third repair operation was finished
> 5) One of the nodes crashed (unknown reason in OS level)
> 6) Repair operation (nodetool repair -pr) was started for the first node
> 7) Repair operation for the second node was started
> 8) Repair operation for the third node was started
> 9) Repair operations finished
> These repair operations are leading to huge amount of small SSTables covering
> the whole time span of the data. The compaction horizon of DTCS was limited
> to 5 days (max_sstable_age_days) due to the size of the SStables on the disc.
> Therefore, small SStables won't be compacted. Below are the time graphs from
> SSTables after the second round of repairs.
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns Host ID
> Rack
> UN xx.xx.xx.170 663.61 GB 256 ?
> dfc29863-c935-4909-9d7f-c59a47eda03d rack1
> UN xx.xx.xx.169 763.52 GB 256 ?
> 12e7628b-7f05-48f6-b7e4-35a82010021a rack1
> UN xx.xx.xx.168 651.59 GB 256 ?
> 26088392-f803-4d59-9073-c75f857fb332 rack1
> See files:
> node0_20150810_1017_time_graph.txt
> node1_20150810_1017_time_graph.txt
> node2_20150810_1017_time_graph.txt
> To get rid of the SStables the TimeWindowCompactionStrategy was taken into
> use. Window size was set to 5 days. Cassandra version was updated to 2.1.8.
> Below figure shows the behavior of SStable count. TWCS was taken into use
> 10.8.2015 at 13:10. The maximum amount of files to be compacted in one task
> was limited to 32 files to avoid running out of disk space.
> See Figure: sstable_count_figure2.png
> Shape of the trend indicates clearly how selection of SStables for buckets
> based on size affects. Combining files gets slower when files are getting
> bigger inside the time window. When the time window does not have any more
> compactions to be done the next time window is started. Combining small files
> is again fast and the number of SStables decreases quickly. Below are the
> time graphs for SStables when compactions were ready with TWCS. New data was
> not dumped simultaneously with compactions.
> See files:
> node0_20150812_1531_time_graph.txt
> node1_20150812_1531_time_graph.txt
> node2_20150812_1531_time_graph.txt
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns Host ID
> Rack
> UN xx.xx.xx.170 436.17 GB 256 ?
> dfc29863-c935-4909-9d7f-c59a47eda03d rack1
> UN xx.xx.xx.169 454.96 GB 256 ?
> 12e7628b-7f05-48f6-b7e4-35a82010021a rack1
> UN xx.xx.xx.168 439.13 GB 256 ?
> 26088392-f803-4d59-9073-c75f857fb332 rack1
> Data dumping was activated again and the SStable statistics were observed
> again on the next morning.
> See files:
> node0_20150813_0835_time_graph.txt
> node1_20150813_0835_time_graph.txt
> node2_20150813_0835_time_graph.txt
> Since the data was dumped to the history the newest data did not come into
> the current time window that is determined from the system time. Since new
> small SSTables (approximately 30- 50 MB in size) are appearing continuously
> the compaction ended up compacting together one large SStable with several
> small files. The code was modified so that the current time is determined
> from the newest time stamp in the SStables (like in DTCS).This modification
> led to much more reasonable compaction behavior for the case when historical
> data is pushed to the database. Below are the time grahps from nodes after
> one day. Size tiered compaction was now able to work with newest files as
> intended while dumping data in real-time.
> See files:
> node0_20150814_1054_time_graph.txt
> node1_20150814_1054_time_graph.txt
> node2_20150814_1054_time_graph.txt
> The change in behavior is clearly visible in the compaction hierarchy graph
> below. TWCS modification is visible starting from the line 39. See the
> description of the file format in
> https://issues.apache.org/jira/browse/CASSANDRA-9644.
> See file: 20150814_1027_compaction_hierarchy.txt
> The behavior of the TWCS looks really promising and works also in practice!!!
> We would like to propose some ideas for future development of the algorithm.
> 1) The current time window would be determined from the newest time stamp
> found in SSTables. This allows the effective compaction of the SSTables when
> data is fed to the history in timely order. In dumping process the time stamp
> of the column is set according to the time stamp of the data sample.
> 2) The count of SSTables participating in one compaction could be limited
> either by the number of files given by max_threshold OR by the sum of size of
> files selected for the compaction bucket. File size limitation would prevent
> combining a large files together potentially causing out of disk space
> situation or extremely long lasting compaction tasks.
> 3) Now time windows are handled one by one starting from the newest. This
> will not lead to the fastest decrease in SStable count. An alternative might
> a round-robin approach in which time windows are stepped through and only one
> compaction task for that given time window is done and then moving to the
> next time window.
> Side note: while observing the compaction process it appears that compaction
> is intermittently using two threads for the compaction. However, sometimes
> during a long lasting compaction task (hours) another thread was not kicking
> in and working with small SSTables even if there were thousands of those
> available for compactions.
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)