[ 
https://issues.apache.org/jira/browse/CASSANDRA-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-7871:
---------------------------------------
    Fix Version/s:     (was: 2.0.13)
                   3.0

> Reduce compaction IO in LCS 
> ----------------------------
>
>                 Key: CASSANDRA-7871
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7871
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Dan Hendry
>            Assignee: Dan Hendry
>             Fix For: 3.0
>
>         Attachments: LeveledCompactionImprovement-2.0.10.patch, 
> experiment.png, levelmultiplier.png, sstablesize.png
>
>
> I have found LCS to be superior to STCS in almost every way - except for the 
> fact that it requires significantly more IO (a well advertised property). In 
> leveled compaction, L ~n+1~ is 10 times larger than L ~n~ so generally 1+10 
> sstables need to be compacted to promote one sstable into the next level. For 
> certain workloads, this practically this means only 1/(10+1)=9% of the IO, 
> specifically write IO, is doing ‘useful’ work. 
> But why is each level 10 times larger? Why 10? Its a pretty looking number 
> and all but thats not a very good reason to choose it. If we chose 5 or even 
> 2 we could reduce the ‘wasted’ io required to promote an sstable to the next 
> level - of course at the expense of requiring more levels. I have not been 
> able to find justification for this choice in either cassandra or leveldb 
> itself. I would like to introduce a new parameter, the leveling multiplier, 
> which controls the desired size difference between L ~n~ and L ~n+1~.
> First and foremost, a little math. Lets assume we have a CF of a fixed size 
> that is receiving continuous new data (ie: data is expiring due to TTLs or is 
> being overwritten). I believe the number of levels required is approximately 
> (see note 1):
> {noformat}data size = (sstable size)*(leveling multiplier)^(level 
> count){noformat}
> Which, when solving for the level count, becomes:
> {noformat}level count = log((data size)/(sstable size))/log(leveling 
> multiplier){noformat}
> The amount of compaction write IO required over the lifetime of a particular 
> piece of data (excluding compactions in L0) is:
> {noformat}write IO = (flush IO) + (promotion IO)*(level count)
> write IO = 1 + (1 + (level multiplier))*log((data size)/(sstable 
> size))/log(leveling multiplier){noformat}
> So ultimately, the the relationship between write IO and the level multiplier 
> is f\(x) = (1 + x)/log\(x) which is optimal at 3.59, or 4 if we round to the 
> nearest integer. Also note that write IO is proportional to log((data 
> size)/(sstable size)) which suggests using larger sstables would also reduce 
> disk IO.
> As one final analytical step we can add the following term to approximate STC 
> in L0 (which is not actually how its implemented but should be close enough 
> for moderate sstable sizes):
> {noformat}L0 write IO = max(0, floor(log((sstable size)/(flush 
> size))/log(4))){noformat}
> The following two graphs illustrate the predicted compaction requirements as 
> a function of the leveling multiplier and sstable size:
> !levelmultiplier.png!!sstablesize.png!
> In terms of empirically verifying the expected results, I set up three 
> cassandra nodes, node A having a leveling multiplier of 10 and sstable size 
> if 160 MB (current cassandra defaults), node B with multiplier 4 and size 160 
> MB, and node C with multiplier 4 and size 1024 MB. I used a simple write only 
> workload which inserted data having a TTL of 2 days at 1 MB/second (see note 
> 2). Compaction throttling was disabled and gc_grace was 60 seconds. All nodes 
> had dedicated data disks and IO measurements were for the data disks only.
> !experiment.png!
> ||Measure||Node A (10, 160MB)||Node B (4, 160MB)||Node C (4, 1024MB)||
> |Predicted IO Rate|34.4 MB/s|26.2 MB/s|20.5 MB/s|
> |Predicted Improvement|n/a|23.8%|40.4%|
> |Predicted Number of Levels (Expected Dataset of 169 GB)|3.0|5.0|3.7|
> |Experimental IO Rate|32.0 MB/s|28.0 MB/s|20.4 MB/s|
> |Experimental Improvement|n/a|12.4%|*36.3%*|
> |Experimental Number of Levels|~4.1|~6.1|~4.8|
> |Final Dataset Size (After 88 hours)|301 GB|261 GB|258 GB|
> These results indicate that Node A performed better than expected, I suspect 
> that this was due to the fact that the data insertion rate was a little too 
> high and compaction periodically got backlogged meaning the promotion from L0 
> to L1 was more efficient. Also note that the actual dataset size is larger 
> than that used in the analytical model - which is expected as expired data 
> will not get purged immediately. The size difference between node A and the 
> others however seems suspicious to me.
> In summary, these results, both theoretical and experimental, clearly 
> indicate that reducing the level multiplier from 10 to 4 and increasing the 
> sstable size reduces compaction IO. The experimental results, using an 
> SSTable size of 1024 MB and level multiplier of 4, demonstrated a 36% 
> reduction in write IO without a significant increase in the number of levels. 
> I have not run benchmarks for an update heavy workload but I suspect it would 
> benefit significantly since more data can be ‘updated’ per compaction. I have 
> also not benchmarked read performance but I would not expect noticeable 
> performance degradation provided an sstable size is chosen which keeps the 
> number of levels roughly equal.
> The patch I have attached is against 2.0.10 and does not change the defaults. 
> Long term however, it would make sense to use more optimal defaults unless 
> there is compelling counter evidence to the performance gains observed.
> One final observation, in current leveled compaction the number of levels is 
> determined by the amount of data and the user specified sstable size. A 
> compaction strategy where instead the user selected the desired number of 
> levels and the strategy adjusted the SSTable size based on the amount of data 
> would have a number of benefits. The strategy would behave more consistently 
> across a much wider range of dataset sizes. Compaction IO overhead (as a 
> function of write rate) and worst case read performance (number of sstables 
> per read) would both be largely independent of dataset size.
> Note 1: This equation only calculates the amount of data able to fit in the 
> largest level. It would be more accurate take into account data in smaller 
> levels (ie: using the geometric series equation) but this is a close enough 
> approximation. There is also the fact that redundant data might be spread 
> across the various levels.
> Note 2: This represents the entropy introduction rate and does not account 
> for any Cassandra overhead but compression was also enabled. The row key was 
> a long, each row had 512 columns, the column name was a UUID, and the column 
> value was a 64 byte blob.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to