Dan Hendry created CASSANDRA-7871:
-------------------------------------

             Summary: Reduce compaction IO in LCS 
                 Key: CASSANDRA-7871
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7871
             Project: Cassandra
          Issue Type: New Feature
          Components: Core
            Reporter: Dan Hendry
         Attachments: experiment.png, levelmultiplier.png, sstablesize.png

I have found LCS to be superior to STCS in almost every way - except for the 
fact that it requires significantly more IO (a well advertised property). In 
leveled compaction, L ~n+1~ is 10 times larger than L ~n~ so generally 1+10 
sstables need to be compacted to promote one sstable into the next level. For 
certain workloads, this practically this means only 1/(10+1)=9% of the IO, 
specifically write IO, is doing ‘useful’ work. 

But why is each level 10 times larger? Why 10? Its a pretty looking number and 
all but thats not a very good reason to choose it. If we chose 5 or even 2 we 
could reduce the ‘wasted’ io required to promote an sstable to the next level - 
of course at the expense of requiring more levels. I have not been able to find 
justification for this choice in either cassandra or leveldb itself. I would 
like to introduce a new parameter, the leveling multiplier, which controls the 
desired size difference between L ~n~ and L ~n+1~.

First and foremost, a little math. Lets assume we have a CF of a fixed size 
that is receiving continuous new data (ie: data is expiring due to TTLs or is 
being overwritten). I believe the number of levels required is approximately 
(see note 1):

{noformat}data size = (sstable size)*(leveling multiplier)^(level 
count){noformat}

Which, when solving for the level count, becomes:

{noformat}level count = log((data size)/(sstable size))/log(leveling 
multiplier){noformat}

The amount of compaction write IO required over the lifetime of a particular 
piece of data (excluding compactions in L0) is:

{noformat}write IO = (flush IO) + (promotion IO)*(level count)
write IO = 1 + (1 + (level multiplier))*log((data size)/(sstable 
size))/log(leveling multiplier){noformat}

So ultimately, the the relationship between write IO and the level multiplier 
is f\(x) = (1 + x)/log\(x) which is optimal at 3.59, or 4 if we round to the 
nearest integer. Also note that write IO is proportional to log((data 
size)/(sstable size)) which suggests using larger sstables would also reduce 
disk IO.

As one final analytical step we can add the following term to approximate STC 
in L0 (which is not actually how its implemented but should be close enough for 
moderate sstable sizes):

{noformat}L0 write IO = max(0, floor(log((sstable size)/(flush 
size))/log(4))){noformat}

The following two graphs illustrate the predicted compaction requirements as a 
function of the leveling multiplier and sstable size:

!levelmultiplier.png!

!sstablesize.png!

In terms of empirically verifying the expected results, I set up three 
cassandra nodes, node A having a leveling multiplier of 10 and sstable size if 
160 MB (current cassandra defaults), node B with multiplier 4 and size 160 MB, 
and node C with multiplier 4 and size 1024 MB. I used a simple write only 
workload which inserted data having a TTL of 2 days at 1 MB/second (see note 
2). Compaction throttling was disabled and gc_grace was 60 seconds. All nodes 
had dedicated data disks and IO measurements were for the data disks only.

!experiment.png!

||Measure||Node A (10, 160MB)||Node B (4, 160MB)||Node C (4, 1024MB)||
|Predicted IO Rate|34.4 MB/s|26.2 MB/s|20.5 MB/s|
|Predicted Improvement|n/a|23.8%|40.4%|
|Predicted Number of Levels (Expected Dataset of 169 GB)|3.0|5.0|3.7|
|Experimental IO Rate|32.0 MB/s|28.0 MB/s|20.4 MB/s|
|Experimental Improvement|n/a|12.4%|*36.3%*|
|Experimental Number of Levels|~4.1|~6.1|~4.8|
|Final Dataset Size (After 88 hours)|301 GB|261 GB|258 GB|

These results indicate that Node A performed better than expected, I suspect 
that this was due to the fact that the data insertion rate was a little too 
high and compaction periodically got backlogged meaning the promotion from L0 
to L1 was more efficient. Also note that the actual dataset size is larger than 
that used in the analytical model - which is expected as expired data will not 
get purged immediately. The size difference between node A and the others 
however seems suspicious to me.

In summary, these results, both theoretical and experimental, clearly indicate 
that reducing the level multiplier from 10 to 4 and increasing the sstable size 
reduces compaction IO. The experimental results, using an SSTable size of 1024 
MB and level multiplier of 4, demonstrated a 36% reduction in write IO without 
a significant increase in the number of levels. I have not run benchmarks for 
an update heavy workload but I suspect it would benefit significantly since 
more data can be ‘updated’ per compaction. I have also not benchmarked read 
performance but I would not expect noticeable performance degradation provided 
an sstable size is chosen which keeps the number of levels roughly equal.

The patch I have attached is against 2.0.10 and does not change the defaults. 
Long term however, it would make sense to use more optimal defaults unless 
there is compelling counter evidence to the performance gains observed.

One final observation, in current leveled compaction the number of levels is 
determined by the amount of data and the user specified sstable size. A 
compaction strategy where instead the user selected the desired number of 
levels and the strategy adjusted the SSTable size based on the amount of data 
would have a number of benefits. The strategy would behave more consistently 
across a much wider range of dataset sizes. Compaction IO overhead (as a 
function of write rate) and worst case read performance (number of sstables per 
read) would both be largely independent of dataset size.


Note 1: This equation only calculates the amount of data able to fit in the 
largest level. It would be more accurate take into account data in smaller 
levels (ie: using the geometric series equation) but this is a close enough 
approximation. There is also the fact that redundant data might be spread 
across the various levels.

Note 2: This represents the entropy introduction rate and does not account for 
any Cassandra overhead but compression was also enabled. The row key was a 
long, each row had 512 columns, the column name was a UUID, and the column 
value was a 64 byte blob.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to