[ 
https://issues.apache.org/jira/browse/CASSANDRA-10540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445374#comment-16445374
 ] 

Lerh Chuan Low edited comment on CASSANDRA-10540 at 4/20/18 6:38 AM:
---------------------------------------------------------------------

Hey Marcus, thanks for getting back so quickly :)

The big motivation for us is repairs...because when vnodes are turned on, every 
SSTable has many vnodes in them...so when a (incremental) repair happens, the 
range it is interested in gets anticompacted out. After that the next range 
gets anticompacted out and so on. RACS solves that big pain. 

Besides making the SSTables per read much nicer, it's like a LCS on steroids. I 
think there are other benefits of maintaining each SSTable to one token 
range...but I can't quite remember any more off the top of my head. 

So I am hoping it doesn't come to grouping the vnodes....unless it's a last 
resort. 

Currently it looks like you create a RACS for each of the 
repaired/unrepaired/pending repairs set, and each RACS keeps track of the 
compaction strategies it is in charge of (which are all of the same class). The 
CS instances are lazily initiated (so that's a win right there) until needed. 
It seems to be that the reason why we want so many CS instances is so that each 
of them can keep track of their own SSTables (which all belong to that single 
token range). 

How about if RACS doesn't instantiate the individual CS instances? It keeps 
track of all the SSTables in the CF like other CS instances - just that the 
logic on which SSTables to involve in a compaction is done in RACS. So we can 
make RACS check L0 and if there are none, L1 would involve grouping the 
SSTables by range and then calling the next background task for the 
underlying/wrapped CS instances and submitting them. 

In this way, the downside is calculating the grouping each time we ask for the 
next background task. We could also store it in memory in the form of a 
manifest like in LCS? So an array with the SSTables in each of them - beats 
having 256 instances but we're still going to have a 256 sized array in memory, 
I guess. It just seems so starkingly similar to a LCS restricted to just L0 and 
L1. 

A final thought: Is the memory footprint actually significant enough for us to 
want to not bite the bullet and further group the vnodes because the gains from 
having each SSTable as a single range is a lot, simple is a feature and RACS is 
customizable? 

Please excuse my ignorance if none of those suggestions made sense/worked, 
still not very confident with the code base..
  
(Btw also feel free to let me know if you would like a hand with anything)


was (Author: lerh low):
Hey Marcus, thanks for getting back so quickly :)

The big motivation for us is repairs...because when vnodes are turned on, every 
SSTable has many vnodes in them...so when a (incremental) repair happens, the 
range it is interested in gets anticompacted out. After that the next range 
gets anticompacted out and so on. RACS solves that big pain. 

Besides making the SSTables per read much nicer, it's like a LCS on steroids. I 
think there are other benefits of maintaining each SSTable to one token 
range...but I can't quite remember any more off the top of my head. 

So I am hoping it doesn't come to grouping the vnodes....unless it's a last 
resort. 

Currently it looks like you create a RACS for each of the 
repaired/unrepaired/pending repairs set, and each RACS keeps track of the 
compaction strategies it is in charge of (which are all of the same class). The 
CS instances are lazily initiated (so that's a win right there) until needed. 
It seems to be that the reason why we want so many CS instances is so that each 
of them can keep track of their own SSTables (which all belong to that single 
token range). 

How about if RACS doesn't instantiate the individual CS instances? It keeps 
track of all the SSTables in the CF like other CS instances - just that the 
logic on which SSTables to involve in a compaction is done in RACS. So we can 
make RACS check L0 and if there are none, L1 would involve grouping the 
SSTables by range and then calling the next background task for the 
underlying/wrapped CS instances and submitting them. 

In this way, the downside is calculating the grouping each time we ask for the 
next background task. We could also store it in memory in the form of a 
manifest like in LCS? So an array with the SSTables in each of them - beats 
having 256 instances but we're still going to have a 256 sized array in memory, 
I guess. It just seems so starkingly similar to a LCS restricted to just L0 and 
L1. 

A final thought: Is the memory footprint actually significant enough for us to 
want to not bite the bullet and further group the vnodes because the gains from 
having each SSTable as a single range is a lot, simple is a feature and RACS is 
customizable? 

Please excuse my ignorance if none of those suggestions made sense/worked, 
still not very confident with the code base..
 

> RangeAwareCompaction
> --------------------
>
>                 Key: CASSANDRA-10540
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10540
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Marcus Eriksson
>            Assignee: Marcus Eriksson
>            Priority: Major
>              Labels: compaction, lcs, vnodes
>             Fix For: 4.x
>
>
> Broken out from CASSANDRA-6696, we should split sstables based on ranges 
> during compaction.
> Requirements;
> * dont create tiny sstables - keep them bunched together until a single vnode 
> is big enough (configurable how big that is)
> * make it possible to run existing compaction strategies on the per-range 
> sstables
> We should probably add a global compaction strategy parameter that states 
> whether this should be enabled or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to